GPU monitoring scheme based on DCGM and Prometheus

Background: In the early GPU monitoring, we used some NVML tools to collect basic information about GPU cards and persist it to the data storage layer of the monitoring system. As we know, basic information of GPU can also be obtained through commands such as Nvidia-SMI. However, with the development and maturity of the whole AI market, a standardized tool system is increasingly needed for GPU monitoring, that is, dcGM-related monitoring solutions described in this article.

Data Center GPU Manager (DCGM) is a set of tools for managing and monitoring Tesla™ Gpus in a cluster environment.

It includes active health monitoring, overall diagnostics, system alerts, and governance policies including power and clock management.

It can be used independently by system administrators and easily integrated into cluster management, resource scheduling, and monitoring products for NVIDIA partners.

DCGM simplifies GPU management in data centers, improves resource reliability and uptime, automates administration tasks, and helps improve overall infrastructure efficiency.

Note: Although relevant information can be collected through the nvidia-SMI command and regularly reported to the data store for data analysis, calculation and presentation, the integration of a whole set of monitoring system still requires a series of modifications by the user. Therefore, we adopted the DCGM solution officially provided by NVIDIA for GPU data acquisition, and integrated the whole monitoring and alarm through Prometheus, which claims to be the next generation monitoring system.

DCGM tool deployment

$ git clone https://github.com/NVIDIA/gpu-monitoring-tools.git

Dcgm-exporter is a gpu data monitoring tool developed by Nvidia for NVIdia-Docker2.x

Finally, the basic metrics information of the GPU card will be stored in the metrics data format in a file

$ cd dcgm-exporter

nvidia/dcgm-exporter:latest

$ make

$ docker run -d --runtime=nvidia --rm --name=nvidia-dcgm-exporter nvidia/dcgm-exporter

Check out GPU Metrics data collected by DCGM-Exporter

$ docker exec -it nvidia-dcgm-exporter tail -n 10 /run/prometheus/dcgm.prom dcgm_ecc_dbe_aggregate_total{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 0

HELP dcgm_retired_pages_sbe Total number of retired pages due to single-bit errors.

TYPE dcgm_retired_pages_sbe counter

dcgm_retired_pages_sbe{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 0

HELP dcgm_retired_pages_dbe Total number of retired pages due to double-bit errors.

TYPE dcgm_retired_pages_dbe counter

dcgm_retired_pages_dbe{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 0

HELP dcgm_retired_pages_pending Total number of pages pending retirement.

TYPE dcgm_retired_pages_pending counter

dcgm_retired_pages_pending{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 0

Copy the code

Dcgm-exporter collection indicators and their meanings:

indicators meaning
dcgm_fan_speed_percent Percentage of GPU Fan Speed (%)
dcgm_sm_clock GPU SM Clock (MHz)
dcgm_memory_clock GPU Memory Clock (MHz)
dcgm_gpu_temp GPU operating temperature (° C)
dcgm_power_usage GPU power (W)
dcgm_pcie_tx_throughput Total number of bytes transferred by GPU PCIe TX (KB)
dcgm_pcie_rx_throughput Total bytes received by GPU PCIe RX (KB)
dcgm_pcie_replay_counter Total number of GPU PCIe retries
dcgm_gpu_utilization GPU Utilization (%)
dcgm_mem_copy_utilization GPU Memory Usage (%)
dcgm_enc_utilization GPU encoder utilization (%)
dcgm_dec_utilization GPU decoder utilization (%)
dcgm_xid_errors Incorrect xID value on GPU
dcgm_power_violation Throttling duration due to GPU power limit (US)
dcgm_thermal_violation GPU Thermal constraint throttling duration (US)
dcgm_sync_boost_violation GPU synchronization enhancement limits, limit duration (US)
dcgm_fb_free GPU FB (Frame Cache) surplus (MiB)
dcgm_fb_used GPU FB (Frame Caching) Usage (MiB)

In fact, DCGM’s toolset has collected the REQUIRED GPU metrics data in accordance with the data format and standards of Prometheus. At this point, we can write a simple API program according to the actual situation and expose the collected data in the form of API. Metrics of individual GPU hosts can be captured and monitored by the entire Prometheus Server.

However, the official provided based on Kubernetes cluster pod way OF API interface, using golang language development, specific use can continue to see.

prometheus gpu metrics exporter

In the GPU-Monitor-Tools project, a POD-Gpu-metrics-exporter module is provided by default for the deployment of GPU-metrics in the Kubernetes cluster. The official example steps are as follows:

  • nvidia-k8s-device-plugin
  • Deploy GPU Pods

Note: To deploy in a Kubernetes cluster, you need to host your GPU in a K8S cluster, which means you need to successfully host your GPU in the cluster and be able to schedule GPU resources

Kubectl Create namespace monitoring $kubectl Create namespace monitoring

Add gpu metrics endpoint to prometheus

$ kubectl create -f prometheus/prometheus-configmap.yaml

Deploy prometheus

$ kubectl create -f prometheus/prometheus-deployment.yaml

$ kubectl create -f pod-gpu-metrics-exporter-daemonset.yaml

Open in browser: localhost:9090

Specific Docker images are built and run

This is still the GPU-Monitor-Tools project

docker build -t pod-gpu-metrics-exporter .

Run the DCGM - exporter

$ docker run -d --runtime=nvidia --rm --name=nvidia-dcgm-exporter nvidia/dcgm-exporter

Run the gpu - metrics - exporter

$ docker run -d --privileged --rm -p 9400:9400 -v /var/lib/kubelet/pod-resources:/var/lib/kubelet/pod-resources - volumes - from nvidia DCGM - exporter: ro nvidia/pod - gpu - metrics - exporter: v1.0.0 - alpha

At this time will be the dcGM-my collection data successfully exposed to the external interface

$ curl -s localhost:9400/gpu/metrics

Copy the code

It is important to note that gPU-metrics-Exporter collects GPU metrics information in a POD-specific manner, along with basic pod information.

Therefore, if your GPU host is not hosted in kubernetes cluster, the image provided by the official may not be used directly. You need to change the path of the SRC /http.go file to gpuPodMetrics. Each will be able to read a different metrics file exposed by DCGM-EXPORTER or else fail to find the Metrics file when accessing the API.

func getGPUmetrics(resp http.ResponseWriter, req *http.Request) { //metrics, err := ioutil.ReadFile(gpuPodMetrics) metrics, err := ioutil.ReadFile(gpuMetrics) if err ! = nil { http.Error(resp, err.Error(), http.StatusInternalServerError) glog.Errorf("error responding to %v%v: %v", req.Host, req.URL, err.Error()) return } resp.Write(metrics) }Copy the code

Refer to the gpu – metrics – exporter

To save trouble, you can download the following two images and run them directly on the GPU host that has already worked.

  • dcgm-exporter: docker pull bgbiao/dcgm-exporter:latest
  • gpu-metrics-exporter: docker pull bgbiao/gpu-metrics-exporter:latest
$docker run -d --runtime=nvidia --rm --name= Nvidia-DCGM-BGbiao/DCGM-mine is running

$ docker run -d --privileged --rm -p 9400:9400 --volumes-from nvidia-dcgm-exporter:ro bgbiao/gpu-metrics-exporter

Check the basic information exposed on the GPU

$ curl -s localhost:9400/gpu/metrics dcgm_ecc_dbe_aggregate_total{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 0 .... .

Copy the code

Prometheus data storage and Grafana data display

Note: with gPU-metrics-exporter mentioned above, our GPU running data can be accessed in a Premeath-compatible manner, and can be configured on Prometheus-Server to periodically pull data.

Our Promethee-Server is currently deployed inside the Kubernetes cluster, so here we share how to capture the monitoring data of gpu hosts outside the cluster into Prometheus inside the Kubernetes cluster and show it using a unified Grafana.

Create an endpoint and the corresponding service

Gpu-metrics endpoint and service configuration

$ cat endpoint-gpus.yaml apiVersion: v1 kind: Endpoints metadata: name: gpu-metrics namespace: monitoring labels: app: gpu-metrics subsets:

  • addresses:
    • IP: 172.16.65.234 ports:
    • port: 9400 name: http-metrics protocol: TCP

apiVersion: v1 kind: Service metadata: namespace: monitoring name: gpu-metrics labels: app: gpu-metrics spec: ports:

  • name: http-metrics port: 19400 targetPort: 9400 protocol: TCP

$ kubectl apply -f endpoint-gpus.yaml

View the created resources

$ kubectl get ep,svc -n monitoring -l app=gpu-metrics NAME ENDPOINTS AGE endpoints/gpu-metrics 172.16.65.234:9400 5m24s

NAME TYPE cluster-ip external-ip PORT(S) AGE service/ gPU-metrics ClusterIP 10.253.138.97 < None > 19400/TCP 5M24s

Test the endpoint exposed by the Service

Ensure that the endpoint exposed by the Service is accessible within the cluster

$curl 10.253.138.97:19400 / gpu/metrics

HELP dcgm_sm_clock SM clock frequency (in MHz).

TYPE dcgm_sm_clock gauge

dcgm_sm_clock{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 1328

HELP dcgm_memory_clock Memory clock frequency (in MHz).

TYPE dcgm_memory_clock gauge

dcgm_memory_clock{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 715

Copy the code

Create a rule for Prometheus to fetch data

$ cat prometheus-gpus.yml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app: Gpu-metrics name: Gpu-metrics namespace: Monitoring spec: Uri-interval: 30s Port: http-metrics path: / GPU /metrics jobLabel: NamespaceSelector: matchNames: - Monitoring selector: matchLabels: app: gpu-metrics

$ kubectl apply -f prometheus-gpus.yml servicemonitor.monitoring.coreos.com/gpu-metrics created

$ kubectl get servicemonitor -n monitoring gpu-metrics NAME AGE gpu-metrics 69s

Copy the code

After the resource is created, the target can be found in Prometheus server in the cluster. If the target status is Up, Prometheus collects metrics data of gpus outside the cluster, and the data interval is 30 seconds. A stream of data collected into Prometheus storage.

prometheus-gpu-targets

Grafana monitor display

At this point, we have reached the final step of the long march, which is to display the monitoring data of the GPU in Prometheus in Grafana to analyze some basic GPU data in real time.

In grafana website, there have been leaders made a gpu monitoring related template, such as [gpu – Nodes – Metrics] (https://grafana.com/grafana/dashboards/12027), as a result, for our users, In the Grafana panel, import the template and use it.

Select an import mode
Specify a template (Dashboard ID or JSON)
Note: Ensure that the Prometheus library is correct before importing
Final GPU monitoring diagram

Refer to the project

gpu-monitor-tools

gpu-metrics-grafana


Knowledge of the planet
The public,

This article is formatted using MDNICE