As an important part of edge computing infrastructure, monitoring is the basic guarantee of edge stability. This paper mainly introduces the monitoring practice of volcano engine edge calculation, and shares how volcano engine chooses monitoring technology and builds monitoring service system. The main contents are as follows:

  1. Edge computing monitoring was originally intended
  2. Monitoring system based on Prometheus
  3. The ground practice
  4. conclusion

01 Edge computing monitoring original intention

As an important part of edge computing infrastructure, monitoring is the basic guarantee of edge stability. In the face of low delay, high bandwidth, heterogeneous convergence and other characteristics of the edge environment, how to more clearly show the operating status of the edge cluster, to cope with the complex and changeable edge environment online challenges? Therefore, volcano engine edge computing needs to build a set of perfect edge computing monitoring and service system.

02 Monitoring system based on Prometheus

Volcano Engine Edge computing uses cloud native architecture, and Prometheus has its inherent advantages as an indicator monitoring tool for cloud native era. Prometheus has the following advantages over other monitoring schemes:

  1. Native support for Kubernetes (hereinafter referred to as K8s) monitoring, with K8s object service discovery capability, and the core component provides Prometheus acquisition interface;
  2. The method of PULL based on HTTP to collect time series data can meet the monitoring requirements of edge multi-cluster.
  3. Non-dependent storage, supporting local and remote storage mode;
  4. PromQL is a data query language, through which users can directly query aggregated data from Prometheus.
  5. Supports a wide variety of charts and interfaces, such as Grafana.

The architecture of a monitoring system based on Prometheus is shown in figure 2. Data source and Prometheus Server are shared in detail.

The data source

In the monitoring system, the data source part contains:

  • Node-exporter: collects indicators of physical nodes.
  • Kube-state-metrics: Collect K8S-related metrics, including resource usage and status information of various objects;
  • Cadvisor: Collect container-related metrics;
  • Monitoring data of apiserver, ETCD, Scheduler, K8S-LVM, GPU and other core components;
  • Other custom metrics can be automatically captured by adding promethe. IO /scrape: “true” to the pod Yaml file Annotations;

Prometheus Server

Prometheus Server is the core module of Prometheus. It mainly contains the crawl, storage and query these three functions:

  • Fetching: Prometheus Server periodically pulls monitoring indicator data from its Exporter through HTTP polling through a service discovery component.
  • Storage: The captured monitoring data is cleaned up and collated by certain rules (the relabel_configs method provided by service discovery is used before capture, and the metrics_relabel_configs method within the job is used after capture), and the obtained results are stored in a new time series for persistence.
  • Query: After Prometheus persisted the data, clients could query the data using PromQL statements.

03 Landing Practice

Overall Monitoring Architecture

Edge computing builds a monitoring architecture based on Prometheus, M3DB and self-developed Metrix. The architecture diagram is as follows:

The whole edge computing monitoring architecture is mainly composed of data acquisition, Prometheus, M3DB, Grafana and Metrix.

  • The data collection

    • Prometheus usually uses Exporters to monitor components. My friend does not actively push monitoring data to the server, but waits for the server to periodically collect data, which is called active monitoring. The exporter used by edge computing includes node_exporter, XLb_exporter, and kubevirt-exporter.
    • The Endpoints object defines device IP addresses and ports to be monitored, and Prometheus Pod pulls metrics data from each device based on ServiceMonitor configuration.
  • Prometheus

    • Prometheus Pod relabel and preaggregate the collected data according to Record-Rules and externalLabels(e.g. Cluster: Bdcdn-bccu) Remote Write M3DB on the remote end.
  • M3DB

    • M3DB is a distributed timing database, implemented Pometheus remote_read and remote_write interfaces, and supports query languages such as PromQL. We used M3DB to save the monitoring data related to edge computing for docking alarm and display.
  • Metrix and Grafana

    • Query data to M3DB through PromQL statement, connect to alarm system Metrix and view display.

Monitoring component

K8s all-component monitoring and service monitoring are realized through several components, mainly Prometheus, as follows:

Monitoring component The function point
prometheus A set of open source monitoring & alarm & time series database combination;
prometheus-adapter Converted to Prometheus monitoring;
grafana An open source metrics analysis and visualization suite;
cAdvisor Resources and containers on the machine are monitored and performance data is collected in real time, including CPU usage, memory usage, network throughput, and file system usage. It’s now integrated into Kubelet;
node-exporter Collect hardware and system indicators in *NIX system;
eventrouter Event collection, you can collect the events in the cluster to ES.
blackbox-exporter Actively monitor host and service status;
Storage M3DB Distributed time series database;
Metrix Self-developed products, used for alarm;

Storage backend

The storage backend is mainly M3DB. M3DB is a native distributed timing database, which provides highly flexible and high-performance aggregation services, query engines, etc. M3DB allows distributed sequential databases to scale well by providing separate components.

  • Distributed sequential data store (M3DB) : provides horizontally scalable sequential data store and reverse indexing;
  • A Sidecar program (M3Coordinator) that can use the M3DB as a remote storage for Prometheus;
  • Distributed Query engine (M3Query) : PromQL, Graphit, and our own M3Query syntax;
  • Aggregation service (M3Aggregator) : Aggregates and degrades data to implement different storage policies for different metrics.

Monitoring indicators

Resource index classification of K8s:

  • Resource metrics: Metrics-server built-in API;
  • Custom indicators: Collect indicators through Prometheus, which requires k8S-Promethy-Adapter.

Metrics -server: API server

Kubectl API-versions does not include metrics. K8s. IO /v1beta1 by default;

When using kube-aggregator, add the prefix kube-aggregator.

You can use Kubectl Top Nodes to get the information.

(2) User-defined indicators

Node_exporter is used to expose node information, as well as other exporters.

The PromQL query statement cannot be directly parsed by K8s. You need to convert the KUbe-state-metrics component to K8S-Promethues-Adpater to Custom metrics API.

(3) HPA — scaling based on resource value index

Specify Deployment, ReplicaSet, or ReplicationController and create an automatic scaler with defined resources. You can use autoscaling to automatically increase or decrease the number of pods deployed in your system as needed.

  • Metrics ServerCluster-level core resources use the aggregator, which passes through each node/stats/summaryInterface provides data to collect CPU and memory usage of nodes and pods. The Summary API is a memory-efficient API for passing data from Kubelet /cAdvisor to Metrics Server.
  • API Server: Aggregates the core resource Metrics provided by Metrics Server. Metrics. K8s. IO /v1beta1 API is provided to HPA for automatic scaling.

At the same time, we used Grafana to query and visualize the collected data. Through the disassembly of the system, monitoring convergence and display were made in physical machine, network, K8s, storage and other levels. The above are the main modules of the whole edge computing monitoring service system.

04 summary

To review the main points of this article:

  • Firstly, the monitoring of edge computing scenarios is introduced, including the challenges and importance of edge computing monitoring scenarios.
  • Then, how to implement edge computing monitoring system based on Prometheus is introduced.
  • Finally, the practical scenarios of edge computing monitoring are introduced, including monitoring architecture and monitoring components, storage backend, monitoring indicators, etc.

The monitoring system based on Prometheus not only meets the monitoring requirements of diversified businesses, but also improves the management, operation and maintenance capabilities of volcano engine edge computing for edge clusters. Prometheus, as a new generation of open source monitoring system, has become the de facto standard for cloud native systems, and its design has proved popular.