preface

Software development is not only to solve the business, it also needs the program to run as long as possible, which involves the stability of the service. Stability involves many factors, both hardware and software need to ensure. In order to make these conditions more adequate, we need to constantly collect data, analyze data, monitor data, and optimize where we can. Prometheus provides us with a good monitoring solution for this.

What is Prometheus?

Prometheus is an open source monitoring and alarm system that pulls and stores indicator values of interest as time series data. If only from its collection function, we can also through mysql, Redis and other ways to achieve. However, the sheer scale of the data being generated all the time requires us to think carefully about how it is stored. In addition, most of these monitoring data are related to statistics, such as the distribution of data and time, which requires professional knowledge of measurement. And that’s where Prometheus excelled.

Because Prometheus focuses on index values and points in time, it is very cheap for external programs to access. This ease-of-use allows us to observe and analyze data in multiple dimensions and angles, making monitoring effects more specific, such as memory consumption, network utilization, connection requests, etc.

Prometheus also features an easy-to-use, extensible data collection and a powerful query language that enables it to quickly alert and locate faults when they occur. As a result, many microservices infrastructures such as K8S, Cloud Native, and others are now connected to Prometheus.

The overall architecture of Prometheus

Prometheus provides a variety of ecological components in addition to its core server to ensure its scalability and reliability. To avoid complexity, consider Prometheus’s core Server from the perspective of God:

The processing link for Prometheus Server is clear: data collection, data storage, and data query. Of course, a complete system will surely spawn many components to support its features. So we’ll see that there are other components in the Prometheus architecture, such as:

  • Pushgateway: provides the Push function for the monitoring node, and then centrally pulls data from Prometheus Server to Pushgateway.
  • Targets Discover: Obtains the address of the monitored node based on service discovery.
  • PromQL: a language for querying indicator data, similar to SQL.
  • Alertmanager: provides alarm services based on configuration rules and counter analysis.

Finally, the overall architecture of Prometheus is as follows:

Indicators (Metrics)

Metrics, as mentioned above, are the focus of Prometheus, which can be simply defined as a metric such as CPU load, memory usage, connection requests, etc. They come from operating systems, application services, device data, and so on, and can vary over time. To make these metrics more measurable, Prometheus provides four metrics types:

  • Counter: a Counter that only increases and does not decrease
  • Gauge: The Gauge can be added or subtracted, and can be changed arbitrarily
  • Histogram (Histogram) : Quantization and average of indicators are similar to 0How many requests are there between 10ms and 10The number of requests between 20ms and how many histograms
  • Summary (Abstract) : Histogram is a simple bucket and bucket counting method on the client side. As percentile estimation based on such limited data by the Prometheus server is not very accurate, the Summary is derived from solving the problem of percentile accuracy.

In fact, indicators in Prometheus consist of indicator names, labels, and indicator values. (Labels are the dimensions we often refer to). For example, here is an HTTP counter type metric:

# HELP prometheus_http_requests_total Counter of HTTP requests.
# TYPE prometheus_http_requests_total counter
prometheus_http_requests_total{code="200",handler="/api/v1/label/:name/values"} 7
prometheus_http_requests_total{code="200",handler="/api/v1/query"} 19
prometheus_http_requests_total{code="200",handler="/api/v1/query_range"} 27
prometheus_http_requests_total{code="200",handler="/graph"} 11
prometheus_http_requests_total{code="200",handler="/metrics"} 8929
prometheus_http_requests_total{code="200",handler="/static/*filepath"} 52
prometheus_http_requests_total{code="302",handler="/"} 1
prometheus_http_requests_total{code="400",handler="/api/v1/query_range"} 6
Copy the code

It is important to note that Prometheus collects data over time, so it generally does not recommend retaining long-term metrics and defaults to 15 days. If monitoring data detects faults, you need to configure alarm discovery to handle them quickly.

Prometheus configuration

There are many detailed tutorials on the use of Prometheus that will not be covered here. Take a look at its key configuration file: Prometheus.yml:

global:
  scrape_interval:     15s
  evaluation_interval: 15s

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']
Copy the code

It is divided into three parts: global configuration (such as data collection interval), alarm rules, and monitoring nodes. Alarm rules are based on PromQL expression trigger conditions, such as:

groups:
	- name: example
	  rules:
	  - alert:  InstanceDown
		expr: up == 0
		for: 1m
		labels:
		  severity: critical
		annotations:
		  summary: Instance has been down for more than 5 minutes
Copy the code

PromQL

PromQL is a built-in data query language for Prometheus that, like Mysql SQL statements, provides rich query capabilities for panel filtering and expressions in alarm rules. Here’s a primer on PromQL.

PromQL is an index of the query, we said that in front of the index is the index name, labels, parameter values, so when we want to query an index, can enter the following expression after the browser to http://localhost:9090/graph:

prometheus_http_requests_total

Then you can see the relevant index values:

The result returned by a query expression like the one above is called a transient vector, meaning that there is only a single value for each metric of each label in the returned result. If we want to query by time range, we need to use interval vector expressions to select our time by []. Such as:

prometheus_http_requests_total[1m]

Query the sample data within the last minute

In addition to transient vectors and interval vectors, PromQL also returns scalar (a floating-point data value) and string types, which allow us to do more operations.

The operator

PromQL allows us to perform calculations on metric results, such as:

  • Arithmetic operators: + (addition), – (subtraction), * (multiplication), / (division), % (modulus), ^ (power).
  • Comparison operators: > (greater than), < (less than), == (equal),! = (not equal to), >= (greater than or equal to), <= (less than or equal to)
  • Logical operators: and, or, unless
  • Aggregation operations: sum (and), min (minimum), AVG (average), count (total), STDdev (calculate the overall standard deviation of the dimension), STDVAR (calculate the overall standard variance of the dimension), etc

With these operators, we have more flexibility with the index values.

Data filtering

Of course, we can also filter data. There are two main filter expressions in PromQL:

  • Perfect match: i.e. = and! The use of the =
  • Regular matching: Carries regular expressions that can be used with =~ and! ~ indicates forward and reverse matching

For example, process_CPU_seconds_total {job=”Node Exporter”} filters indicator data with job=”Node Exporter”.

Data is stored

Prometheus 2.x keeps its time series database on local disk by default, although it is possible to store data to third-party storage services.

The local store

Prometheus stores data generated during a two-hour window in a Block, known as a Block. Each block is a separate directory containing all sample data (chunks), metadata files (meta. Json), and index files (indexes) in the corresponding time window.

The index file indexes the indicator name and label into the time series of the boilerplate data. If a time series is deleted through the API during this period, the deleted record will be stored in a separate logical tombstone file.

The block where the sample data resides is stored directly in memory and not persisted to disk. To ensure that data can be recovered when Prometheus crashes or restarts, write-Ahead-log (WAL) is used to record data during Prometheus startup. The pre-write log files are stored in the wal directory and each file size is 128MB. Wal files contain raw data that has not yet been compressed, so they are much larger than regular block files. Prometheus typically maintains three wal files, but more than three wal files for high-load servers that require more than two hours of raw data.

Remote storage

Due to extensibility and persistence, Prometheus limited local storage to a single node, so rather than providing a cluster storage solution, Prometheus provided a series of interfaces for integration with remote storage systems, For example, when the Remote Write URL is configured in the Promethe. yml configuration file, Prometheus sends the collected sample data to the adapter via HTTP for subsequent access to external services. External services can be real storage systems, cloud storage, message queues, and so on.

As with Remote Write, when we configure the Remote Read URL in the configuration file, we query the data in the Adaptor via HTTP. Adaptor then goes to a third-party storage service to fetch the data and forward it back.

Prometheus shortcomings

Since Prometheus uses metrics as its key data, when we want to trace a link to the data, it is not possible. And its data is chronological, so if we want to provide some reporting, it’s hard.

Additionally, because Prometheus was designed for simplicity and extensibility, it has little to do with distributed storage, clustering, and multi-tenancy, and is more focused on real-time monitoring.

conclusion

System monitoring is a key consideration in every mature architecture, and it is an important part of the infrastructure that allows us to detect and resolve problems in advance. Prometheus, a popular open source monitoring system, is becoming standard, so it’s a good idea to get familiar with it and use it as part of our development efforts to ensure business stability.

reference

  • [1]what is prometheus?
  • [2] An article on understanding and using metrics for Prometheus
  • [3] Prometheus Chinese documentation

Interested friends can search the public account “Read new technology”, pay attention to more pushed articles. Thank you for your support! Read new technology, read more new knowledge.