GPU monitoring scheme based on DCGM and Prometheus

Background: In the early GPU monitoring, we used some NVML tools to collect basic information about GPU cards and persist it to the data storage layer of the monitoring system. As we know, basic information of GPU can also be obtained through commands such as Nvidia-SMI. However, with the development and maturity of the whole AI market, a standardized tool system is increasingly needed for GPU monitoring, that is, dcGM-related monitoring solutions described in this article.

Data Center GPU Manager (DCGM) is a set of tools for managing and monitoring Tesla™ Gpus in a cluster environment.

It includes active health monitoring, overall diagnostics, system alerts, and governance policies including power and clock management.

It can be used independently by system administrators and easily integrated into cluster management, resource scheduling, and monitoring products for NVIDIA partners.

DCGM simplifies GPU management in data centers, improves resource reliability and uptime, automates administration tasks, and helps improve overall infrastructure efficiency.

Note: Although relevant information can be collected through the nvidia-SMI command and regularly reported to the data store for data analysis, calculation and presentation, the integration of a whole set of monitoring system still requires a series of modifications by the user. Therefore, we adopted the DCGM solution officially provided by NVIDIA for GPU data acquisition, and integrated the whole monitoring and alarm through Prometheus, which claims to be the next generation monitoring system.

DCGM tool deployment

$ git clone https://github.com/NVIDIA/gpu-monitoring-tools.git Dcgm-exporter is a gpu data monitoring tool developed by Nvidia for NVIdia-Docker2.x Finally, the basic metrics information of the GPU card will be stored in the metrics data format in a file $ cd dcgm-exporter nvidia/dcgm-exporter:latest $ make $ docker run -d --runtime=nvidia --rm --name=nvidia-dcgm-exporter nvidia/dcgm-exporter Check out GPU Metrics data collected by DCGM-Exporter $ docker exec -it nvidia-dcgm-exporter tail -n 10 /run/prometheus/dcgm.prom dcgm_ecc_dbe_aggregate_total{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 0 HELP dcgm_retired_pages_sbe Total number of retired pages due to single-bit errors. TYPE dcgm_retired_pages_sbe counter dcgm_retired_pages_sbe{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 0 HELP dcgm_retired_pages_dbe Total number of retired pages due to double-bit errors. TYPE dcgm_retired_pages_dbe counter dcgm_retired_pages_dbe{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 0 HELP dcgm_retired_pages_pending Total number of pages pending retirement. TYPE dcgm_retired_pages_pending counter dcgm_retired_pages_pending{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 0 Copy the code

Dcgm-exporter collection indicators and their meanings:

indicators	meaning
dcgm_fan_speed_percent	Percentage of GPU Fan Speed (%)
dcgm_sm_clock	GPU SM Clock (MHz)
dcgm_memory_clock	GPU Memory Clock (MHz)
dcgm_gpu_temp	GPU operating temperature (° C)
dcgm_power_usage	GPU power (W)
dcgm_pcie_tx_throughput	Total number of bytes transferred by GPU PCIe TX (KB)
dcgm_pcie_rx_throughput	Total bytes received by GPU PCIe RX (KB)
dcgm_pcie_replay_counter	Total number of GPU PCIe retries
dcgm_gpu_utilization	GPU Utilization (%)
dcgm_mem_copy_utilization	GPU Memory Usage (%)
dcgm_enc_utilization	GPU encoder utilization (%)
dcgm_dec_utilization	GPU decoder utilization (%)
dcgm_xid_errors	Incorrect xID value on GPU
dcgm_power_violation	Throttling duration due to GPU power limit (US)
dcgm_thermal_violation	GPU Thermal constraint throttling duration (US)
dcgm_sync_boost_violation	GPU synchronization enhancement limits, limit duration (US)
dcgm_fb_free	GPU FB (Frame Cache) surplus (MiB)
dcgm_fb_used	GPU FB (Frame Caching) Usage (MiB)

In fact, DCGM’s toolset has collected the REQUIRED GPU metrics data in accordance with the data format and standards of Prometheus. At this point, we can write a simple API program according to the actual situation and expose the collected data in the form of API. Metrics of individual GPU hosts can be captured and monitored by the entire Prometheus Server.

However, the official provided based on Kubernetes cluster pod way OF API interface, using golang language development, specific use can continue to see.

prometheus gpu metrics exporter

In the GPU-Monitor-Tools project, a POD-Gpu-metrics-exporter module is provided by default for the deployment of GPU-metrics in the Kubernetes cluster. The official example steps are as follows:

nvidia-k8s-device-plugin
Deploy GPU Pods

Note: To deploy in a Kubernetes cluster, you need to host your GPU in a K8S cluster, which means you need to successfully host your GPU in the cluster and be able to schedule GPU resources

Kubectl Create namespace monitoring $kubectl Create namespace monitoringAdd gpu metrics endpoint to prometheus
$ kubectl create -f prometheus/prometheus-configmap.yaml
Deploy prometheus
$ kubectl create -f prometheus/prometheus-deployment.yaml
$ kubectl create -f pod-gpu-metrics-exporter-daemonset.yaml
Open in browser: localhost:9090
Specific Docker images are built and run
This is still the GPU-Monitor-Tools project
 docker build -t pod-gpu-metrics-exporter .
Run the DCGM - exporter
$ docker run -d --runtime=nvidia --rm --name=nvidia-dcgm-exporter nvidia/dcgm-exporter
Run the gpu - metrics - exporter
$ docker run -d --privileged --rm -p 9400:9400 -v /var/lib/kubelet/pod-resources:/var/lib/kubelet/pod-resources - volumes - from nvidia DCGM - exporter: ro nvidia/pod - gpu - metrics - exporter: v1.0.0 - alpha
At this time will be the dcGM-my collection data successfully exposed to the external interface
$ curl -s localhost:9400/gpu/metrics
Copy the code

It is important to note that gPU-metrics-Exporter collects GPU metrics information in a POD-specific manner, along with basic pod information.

Therefore, if your GPU host is not hosted in kubernetes cluster, the image provided by the official may not be used directly. You need to change the path of the SRC /http.go file to gpuPodMetrics. Each will be able to read a different metrics file exposed by DCGM-EXPORTER or else fail to find the Metrics file when accessing the API.

func getGPUmetrics(resp http.ResponseWriter, req *http.Request) { //metrics, err := ioutil.ReadFile(gpuPodMetrics) metrics, err := ioutil.ReadFile(gpuMetrics) if err ! = nil { http.Error(resp, err.Error(), http.StatusInternalServerError) glog.Errorf("error responding to %v%v: %v", req.Host, req.URL, err.Error()) return } resp.Write(metrics) }Copy the code

Refer to the gpu – metrics – exporter

To save trouble, you can download the following two images and run them directly on the GPU host that has already worked.

dcgm-exporter: docker pull bgbiao/dcgm-exporter:latest
gpu-metrics-exporter: docker pull bgbiao/gpu-metrics-exporter:latest

$docker run -d --runtime=nvidia --rm --name= Nvidia-DCGM-BGbiao/DCGM-mine is running$ docker run -d --privileged --rm -p 9400:9400 --volumes-from nvidia-dcgm-exporter:ro bgbiao/gpu-metrics-exporter Check the basic information exposed on the GPU $ curl -s localhost:9400/gpu/metrics dcgm_ecc_dbe_aggregate_total{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 0 .... . Copy the code

Prometheus data storage and Grafana data display

Note: with gPU-metrics-exporter mentioned above, our GPU running data can be accessed in a Premeath-compatible manner, and can be configured on Prometheus-Server to periodically pull data.

Our Promethee-Server is currently deployed inside the Kubernetes cluster, so here we share how to capture the monitoring data of gpu hosts outside the cluster into Prometheus inside the Kubernetes cluster and show it using a unified Grafana.

Create an endpoint and the corresponding service

Gpu-metrics endpoint and service configuration$ cat endpoint-gpus.yaml apiVersion: v1 kind: Endpoints metadata: name: gpu-metrics namespace: monitoring labels: app: gpu-metrics subsets: addresses: IP: 172.16.65.234 ports: port: 9400 name: http-metrics protocol: TCP apiVersion: v1 kind: Service metadata: namespace: monitoring name: gpu-metrics labels: app: gpu-metrics spec: ports: name: http-metrics port: 19400 targetPort: 9400 protocol: TCP $ kubectl apply -f endpoint-gpus.yaml View the created resources $ kubectl get ep,svc -n monitoring -l app=gpu-metrics NAME ENDPOINTS AGE endpoints/gpu-metrics 172.16.65.234:9400 5m24s NAME TYPE cluster-ip external-ip PORT(S) AGE service/ gPU-metrics ClusterIP 10.253.138.97 < None > 19400/TCP 5M24s Test the endpoint exposed by the Service Ensure that the endpoint exposed by the Service is accessible within the cluster $curl 10.253.138.97:19400 / gpu/metrics HELP dcgm_sm_clock SM clock frequency (in MHz). TYPE dcgm_sm_clock gauge dcgm_sm_clock{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 1328 HELP dcgm_memory_clock Memory clock frequency (in MHz). TYPE dcgm_memory_clock gauge dcgm_memory_clock{gpu="0",uuid="GPU-b91e30ac-fe77-e236-11ea-078bc2d1f226"} 715 Copy the code

Create a rule for Prometheus to fetch data

$ cat prometheus-gpus.yml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app: Gpu-metrics name: Gpu-metrics namespace: Monitoring spec: Uri-interval: 30s Port: http-metrics path: / GPU /metrics jobLabel: NamespaceSelector: matchNames: - Monitoring selector: matchLabels: app: gpu-metrics$ kubectl apply -f prometheus-gpus.yml servicemonitor.monitoring.coreos.com/gpu-metrics created $ kubectl get servicemonitor -n monitoring gpu-metrics NAME AGE gpu-metrics 69s Copy the code

After the resource is created, the target can be found in Prometheus server in the cluster. If the target status is Up, Prometheus collects metrics data of gpus outside the cluster, and the data interval is 30 seconds. A stream of data collected into Prometheus storage.

Grafana monitor display

At this point, we have reached the final step of the long march, which is to display the monitoring data of the GPU in Prometheus in Grafana to analyze some basic GPU data in real time.

In grafana website, there have been leaders made a gpu monitoring related template, such as [gpu – Nodes – Metrics] (https://grafana.com/grafana/dashboards/12027), as a result, for our users, In the Grafana panel, import the template and use it.

Specify a template (Dashboard ID or JSON)

Note: Ensure that the Prometheus library is correct before importing

Refer to the project

gpu-monitor-tools

gpu-metrics-grafana

This article is formatted using MDNICE

GPU monitoring scheme based on DCGM and Prometheus