Prometheus

What is the Prometheus

  • A new generation of open source monitoring and alarm solutions, Prometheus became the second member of Cloud Native Computing Foundation (CNCF) after K8S in 2016

The characteristics of Prometheus

Easier to manage

  1. Prometheus has a single binary at its core, no third party dependencies (databases, caches, etc.), only local disks required, and therefore no risk of potential cascading failures
  2. Prometheus builds its monitoring system from anywhere (PC, development environment, test environment) based on the Pull model
  3. For complex situations, monitoring targets can also be dynamically managed using the capabilities of the Prometheus Service Discovery

Monitor the internal health of the service

  1. Prometheus encourages users to monitor the internal status of services, and users can easily add Support for Prometheus to their applications based on Prometheus’ rich Client library to obtain the true internal status of services and applications

Powerful data model

  1. All monitoring data collected are stored in the built-in TSDB in the form of metrics. In addition to the basic metric name, all samples contain a set of labels that describe the characteristics of the sample

Powerful query language PromQL

  1. Prometheus provides PromQL, a powerful database query language, through which monitoring data can be queried and aggregated; PromQL is also used for data visualization (such as Grafana) and for alarms

    PromQL makes it easy to answer questions like:

    1. Distribution range of 95% application latency over time
    2. What is the estimated disk space footprint after 4 hours
    3. What are the top 5 SERVICES with CPU usage (filtering)

efficient

  1. For a monitoring system, a large number of monitoring tasks inevitably lead to a large amount of data, which Prometheus can efficiently process. For a single Instance of Prometheus Server, it can process:

    1. Millions of monitoring indicators
    2. Processing hundreds of thousands of data points per second

extensible

  1. Independent Prometheus Server can be run on each team in each data center. Prometheus’ support for federated clustering enables multiple Prometheus instances to produce a logical cluster. When the number of tasks handled by a single instance of Prometheus Server is too large, It can be extended by using functional partitioning (Sharding) + federated clustering (Federation)

Ease of integration

  1. Monitoring services can be set up quickly using Prometheus and are easily integrated into applications, currently supporting: Client SDKS for Java, JMX, Python, Go, Ruby,.NET, Node.js, etc., on which applications can be quickly monitored by Prometheus or developed for their own monitoring data collection
  2. Monitoring data collected by these clients support not only Prometheus but also other monitoring tools such as Graphite
  3. At the same time, Prometheus also supports integration with other monitoring systems: Graphite, Statsd, Collected, Scollector, MUini, Nagios, etc. Prometheus community also provides a large number of third-party monitoring data acquisition support: JMX, CloudWatch, EC2, MySQL, PostgresSQL, Haskell, Bash, SNMP, Consul, Haproxy, Mesos, Bind, CouchDB, Django, Memcached, RabbitMQ, Redis , etc.

Core components

  • Prometheus Server, primarily for fetching and storing sequential data, provides query and Alert Rule configuration management
  • Client libraries: Connects to the Prometheus Server to query and report data
  • Push gateway is a summary node for batch and short-term monitoring data, mainly used for service data reporting
  • Exporters of all kinds of data, such as Node_exporter who reports machine data and MongoDB exporter who reports MongoDB information
  • Alertmanager for alarm notification management

The infrastructure

  • The main modules of Prometheus include Server, Exporters, Pushgateway, PromQL, Alertmanager, WebUI, etc

    It uses roughly this logic:

    1. Prometheus Server periodically pulls data from targets statically configured or discovered by the targets service
    2. Prometheus persists data to disk (if remote storage is used to persist data to the cloud) when the newly pulled data exceeds the configured in-memory cache.
    3. Prometheus can configure rules to periodically query data and push alerts to the configured Alertmanage when conditions are triggered
    4. When an Alertmanager receives a warning, it can be configured to aggregate, de-weight, de-noise, and finally send a warning
    5. Data can be queried and aggregated using the API, Prometheus Console, or Grafana

Storage computing layer

  1. Prometheus Server, which includes a storage engine and a computing engine
  2. The Retrieval component actively pulls indicator data from the Pushgateway or Exporter
  3. Service Discovery dynamically discovers the target to monitor
  4. TSDB, data core storage and query
  5. HTTP server: provides HTTP services externally

Collecting layer

The acquisition layer is divided into two categories: short life cycle operations and long life cycle operations

  1. Short job: Push the exit time indicator to Pushgateway directly through the API
  2. Long jobs: The Retrieval component pulls data directly from the Job or Exporter

The application layer

The application layer is mainly divided into two types, one is AlertManager, the other is data visualization

  1. AlertManager: can connect to Pageduty, is a set of paid monitoring and alarm system, can realize SMS alarm, 5 minutes no ACK, call notification, still no ACK, notify the Manager on duty…

    Email Sending an Email

  2. Data visualization: Prometheus Build-in WebUI, Grafana, other API-based clients

Prometheus configuration

Configuration instructions

After the installation, go to the Prometheus directory and view or modify Proemtheus. Yaml

  1. Global configuration block: Controls the global configuration of a Prometheus server

    1. Scrape_interval: Configures the interval for pulling data. The default value is one minute
    2. Evaluation_interval: Interval for rule validation (generating alert), default is one minute
  2. Rule_files Configuration block: rule configuration file

  3. Scrape_configs Configuration block: Configures the collection target, which Prometheus monitors. Prometheus’ own running information is accessible through HTTP, so Prometheus can monitor its own running data

    1. Job_name: indicates the name of the monitoring job
    2. Static_config: indicates a static target configuration that pulls data from a target
    3. The targets: specifies the target monitoring, is where the pull data, Prometheus would pull data from the http://hadoop202:9090/metrics
  4. Prometheus is automatically loaded at runtime with the following configuration: –web.enable-lifecyucle

    Configuration example:

    Configure multiple collection modes

Pushgateway

  • Prometheus normally operates in pull mode from operations that produce metric or, rather, from a dedicated monitoring host, run the mine, but we are monitoring Flink on YARN operations, It was obviously difficult for Prometheus to commit, terminate, and automatically pull data from a discovery job

    Pushgateway is a relay component that pushes the metric to Pushgateway by configuring Flink on YARN, from which Prometheus pulls

AlertManager

  • Optional installation

  • Example for configuring alarm rules, alertManager.yaml

Node Exporter

  • In the architecture of Prometheus, Prometheus Server is responsible for data collection, storage, and external data query support, while Prometheus Server is operated by my friend, so that it can monitor certain things, such as CPU usage of the host. We need to use Prometheus to periodically pull monitoring samples from the HTTP service address (usually /metrics) that our Exporter is exposing
  • A friend can be a relatively developed concept that runs independently of the monitor target or is directly built into the monitor target. As long as monitoring sample data can be provided to Prometheus in a standard format
  • In order to collect host operating indicators such as CPU, memory, disk and other information, we can use Node, which is also written by Golang and without any third party dependence, and can be run by downloading and decompression. The latest Node exporter version of the binary package is available from Promethe. IO /downlaod/

PromQL

  • Prometheus defines a unique time series by metric name (metric name) and corresponding set of labels (labelset). Index name reflects the basic identification of monitoring samples, while label provides multiple feature dimensions for uncollected data based on these feature dimensions, which can be filtered, aggregated, and counted to generate a new calculated time series. PromQL is the built-in data query language of Prometheus. It provides rich queries for time series data. Aggregation and support for logical computing power. It was widely used in Prometheus for routine use, aggregation and support for logical computing capabilities. It is widely used in the daily applications of Prometheus, including data query, visualization, and alarm processing. It can be said that PromQL is the basis of all Prometheus application scenarios, and understanding and mastering PromQL is the first lesson of Prometheus introduction

The data model

  • All monitoring data collected by Prometheus is stored in the built-in time Series database (TSDB) in the form of time series (time-stamped data streams with the same indicator name and tag set). In addition to stored time series, Prometheus may return temporary, derived time series generated on query requests
Metric Name and Label
  • Each time series is uniquely identified by a Metic indicator Name and a set of tags (key-value pairs). Where the name of the metric can reflect the meaning of the monitored sample (for example, http_requets_total indicates the total number of HTTP requests received by the current system)

    Through the use of tags, Prometheus enabled a powerful multidimensional data model: for the same metric name, different tag list collections form specific instances of metric dimensions (e.g. All HTTP requests containing metric names/API /tracks are labeled with method=POST to form a concrete HTTP request). Changing any label value on any metric, including adding or removing metrics, creates a new time series

Samples of the Sample
  • Sample constitutes the real time series value (Sample), which consists of the following parts:

    • Timestamp: A timestamp accurate to the millisecond
    • Sample value: A float64 representation of the current sample value

Basic usage

Query time series
  • After Prometheus collects the corresponding monitoring indicator sample data through ITS Exporter, we can query the monitoring sample data through PromQL

    When you directly use the monitoring indicator name to query, you can query all time series of this indicator

    prometheus_http_requests_total
    Copy the code

  • PromQL also allows users to filter time series based on tag matching patterns. Currently, PromQL supports two main matching modes: full matching and correct matching

    1. PromQL supports using = and! = Two exact matching modes

      1. You can select which labels satisfy the time series defined by the expression by using label=value
      2. Instead, use label! =value can exclude time series based on tag matching

      filter

      prometheus_http_requests_total{code="200"} prometheus_http_requests_total{code! = "200"}Copy the code

    2. PromQL can also support the use of regular expressions as matching conditions, multiple expression directly using | separation

      1. Use label=~regx to select time series whose labels match the regular expression definition
      2. Instead, use label! ~regx to exclude
Range queries
  • Httprequesttotal returns only the latest sample value from the PromQL expression. The result is called an instantaneous vector, and the response is called an instantaneous vector

    If we want to get the sample data over a period of time, we need to use interval vector expressions. The difference between interval vector expression and instantaneous vector expression is that we need to define the range of time selection in interval vector expression, and the time range is defined by the time range selector []

    For example, you can select all sample data within the last 5 minutes by using the following expression

    prometheus_http_requests_total{}[5m]
    Copy the code

    The results obtained through interval vector expressions are called interval vectors. In addition to using m for minutes, the PromQL time range selector supports other units of time:

    • S: s
    • M: minutes
    • H: hours
    • D: oh,
Time shift operation
  • In instantaneous vector expressions or interval vector expressions, both are based on the current time:

    Prometheus_http_requests_total {} : instantaneous vector expression that selects the latest data

    Prometheus_http_request_total {}[5m] : interval vector expression, select the data within 5 minutes based on the current time

  • And what if we want to query the instantaneous sample data from five minutes ago, or the sample data from yesterday? At this time, we can use the displacement operation, the keyword of the displacement operation is offset, we can use the offset time displacement operation:

    prometheus_http_requests_total{} offset 5m
    prometheus_http_requests_total{} offset 1d
    Copy the code

Using aggregate operations
  • In general, if the label describing the characteristics of the sample is not unique, the PromQL query data will return multiple time series that satisfy these characteristic dimensions

    PromQL provides aggregation operations that can be used to process these time series to form a new time series:

    Sum (prometheus_http_requests_total) # Avg (node_cpu_seconds_total) by (idle) # Sum (sum(irate(node_cpu_seconds_total{mode! ='idle'}[5m])) / sum(irate(node_cpu_seconds_total[5m]))) by (instance)Copy the code

Scalars and Strings
  • In addition to using instantaneous vector expressions and interval vector expressions, PromQL directly supports users with scalars and strings.

    1. Scalar: A number value with a floating point type

      Scalars have only one number and no timing

      Note that when the expression count(PROmetheus_HTTP_requests_total) is used, the data type returned is still an instantaneous vector. The user can convert a single instantaneous vector to a scalar with the built-in scalar function Scalar ()

    1. String: A simple String value

      Using a string directly, as a PromQL expression, returns the string directly

A valid PromQL expression
  • All PromQL expressions must contain at least one metric name (e.g. Http_request_total), or a label filter that does not match an empty string (e.g. {code=”200″}).

    Therefore, both of the following expressions are valid

    Prometheus_http_request_total {} #Copy the code

    The following expression is not valid:

    {job=~".*"} # InvalidCopy the code

    In addition to the {label=value} form, we can also use the built-in name label to specify the monitoring indicator name

    {_name_ = ~ "prometheus_http_requests_total} # legal {_name_ = ~" node_disk_bytes_read | node_disk_bytes_written "} # legalCopy the code

PromQL operator

  • In addition to making it easy to query and filter time series, PromQL also supports a wealth of operators that can be used to further reprocess event sequences: mathematical, logical, Boolean, and so on
Mathematical operations

All of the mathematical operators supported by PromQL are as follows:

  • Plus (plus)
  • – (subtraction)
  • * (multiplication)
  • / (division)
  • % (remainder)
  • ^ (power operation)
Boolean operation
  1. Prometheus supports the following Boolean operators:

  1. Use the bool modifier to change the behavior of the Boolean operator

    The default behavior of Boolean operators is to filter temporal data, whereas in other cases we might want true Boolean results

Set dependence operation
  • A set containing multiple time series can be obtained by using instantaneous vector expression, which is called instantaneous vector. Through set operation, corresponding set operation can be carried out between instantaneous vector and instantaneous vector

    Currently, Prometheus supports the following set operators:

    • And (and)
    • Or
    • Unless (I)

    Vector1 and vector2 produce a new vector consisting of elements from vector1 that match vector2 exactly

    A vector1 or vector2 produces a new vector that contains all the sample data in vector1 and any sample data in vector2 that does not match vector1

Operator priority
  • For complex type expressions, you need to know the running priority of the operation. For example, to query the CPU usage of a host, use the following expression:

    100*(1-avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (job))Copy the code

  • In Prometheus, the order of precedence of binary operators from high to low is

Aggregation operations
  • Prometheus also provides the following built-in aggregation operators that work with transient vectors and aggregate the sample data returned by transient expressions into a new time series

    • Sum (sum)
    • Min (minimum value)
    • Max (maximum)
    • Avg (average)
    • Stddev (standard deviation)
    • Stdvar (Standard Difference)
    • Count (= count)
    • Count_values (count values)
    • Bottomk (last N sequence)
    • Topk (the first N sequences)
    • Quantile
  • Without is used to remove the enumerated labels from the calculation and leave the others. By, on the other hand, only the listed tags are kept in the resulting vector and the rest are removed. With “without” and “by” you can aggregate the data according to the sample’s questions

    Such as:

    sum(prometheus_http_requests_total) without(instance,pod,service,namespace)
    Copy the code

    Is equivalent to

    sum(prometheus_http_requests_total) by (code,endpoint,handler,job)
    Copy the code

  • Cout_values is the number of occurrences of each sample value in the time series, count_values outputs a time series for each unique sample value, and each time series contains an additional label. Such as:

    count_values("count",prometheus_http_requests_total)
    Copy the code

  • Topk and bottomk are used to sort the sample values and return the time series of the first n or last n bits of the current sample value

    To get time series sample data for the top 5 bits of HTTP requests, use the expression:

    topk(5,prometheus_http_requests_total)
    Copy the code

  • Quantile is used to calculate the distribution of the current sample data values Quantile (x, Express), where 0<=x<=1

    For example, when x is 0.5, it means to find the median in the current sample data:

    Quantile (0.5, prometheus_http_requests_total)Copy the code