This paper introduces the official high availability scheme and self-developed high availability scheme of Prometheus.

I. Realistic small-scale high-availability solutions

The official documentation for Prometheus provides only one solution to the high availability of Prometheus as follows:

Two Prometheus hosts monitor the same target, and when an alarm occurs, the same alarm is sent to the Alertmanager, and only one alarm is emitted using the alarm de-severity function of the Alertmanager. Thus implementing a highly usable architecture for Prometheus. Based on this architecture, we can also use Keepalived for dual hot standby, connected to Grafana via VIP. Implement a complete highly available Prometheus monitoring architecture with a Web interface for displaying alarms.

Based on the obtained information, the table shows the relationship between the number of Prometheus monitored and the memory and disk size of the Prometheus host.

According to the data in the table, we could use two 8 GB Prometheus hosts with 100 GB disk size as primary and secondary architecture to monitor the infrastructure of less than 500 nodes, and since capture intervals and data retention times are directly related to memory and disk utilization, So we can adjust these two points to adjust memory and disk space to the appropriate value.

High availability scheme for mass monitoring

According to the official documentation, Prometheus has a large-scale target monitoring function ** FEDERATION ** a federated mechanism that captures specific data from other Prometheus hosts into a consolidated Prometheus host, The amount of data collected from other Prometheus hosts is too large to be stored locally for long, so we need to use Prometheus’ remote read-write database to store data to a third party database. For the Prometheus host for summary, we also use the primary and secondary hosts for high availability processing, but an Adapter tool is required to switch between the primary and secondary databases. As shown in the figure below.

The third party storage is PostgreSQL + TimescaleDB, and the adapter is promethees-Postgresql-Adpter, which is the official development of promethees-Postgresql-Adpter. When Prometheus and Adapter are set, if the Adapter does not receive data from Prometheus for a long time, it automatically locks and switches to the standby adapter. The standby ADPter sends data from its Prometheus host to a third-party storage. In other words, both Prometheus hosts receive data from the same Prometheus host in real time, and only one of the hosts’ data is sent to the third party storage by the adapter identified as the Leader. The complete architecture diagram is shown below.

Third, summary

High availability method mentioned in official documentation of Prometheus and remote read/write storage function of federated mechanism of Prometheus are mainly used in both small-scale high availability monitoring scheme in Chapter 1 and large-scale high availability monitoring scheme in Chapter 2. The active/standby switchover tools Keepalive and Prometheus- Postgresql-adpter, as well as the remote database PostgresQL +TimescaleDB, can be replaced with Nginx Proxy, service registration tool Consul, Remote storage Thanos, we can test according to the actual needs, and then decide which third-party tools to use.

From: WRF2020 jianshu.com/p/bccfc58bc