Prometheus Exporter: Nvidia GPU Usage per Docker Container

Tracking GPU Consumption in Docker Containers with Prometheus

TJ. Podobnik, @dorkamotorka
Level Up Coding

--

The advent of AI has introduced new challenges within the constantly evolving cloud computing sector. Every AI model typically relies on one or more GPUs for both training and general application use. Given the significant advancements in hardware technology in recent years, allocating a single GPU per application container would be excessive; instead, GPUs are often shared across multiple containers. Yet, every Prometheus exporter I’ve encountered only provides data on overall GPU consumption, not usage per container. In this post, we’ll explore a custom Prometheus exporter designed to address this specific need.

⚠️ Note: This post assumes some general understanding of containers, GPUs, and Prometheus.

First off, what’s already there?

In my search for robust and reliable monitoring tools, preferably developed by the original equipment manufacturers, I encountered a few promising options like NVML and DCGM exporters for NVIDIA GPUs. However, none of these tools offer the specific container-level metrics I was after. This gap in functionality is somewhat understandable, considering that from a GPU’s perspective, it’s engaging with a process, not discerning whether that process is isolated within a container. This realization led me to accept that a tailor-made solution was necessary. Although I won’t delve into why GPUs are pivotal for application containers — a usage that’s become increasingly common — I was nonetheless surprised by the lack of ready-made solutions readily discoverable through a simple Google search.

Custom Prometheus Exporter

Initially, it was crucial to ascertain the feasibility of monitoring GPU usage on a per-container basis. Technically, the prospect seemed viable as process IDs could be used to identify which processes were utilizing the GPU, and whether these were operating within containers. While a quick and rudimentary solution involved scripting calls to nvidia-smi and the Docker CLI, I aimed for a more refined approach. Thus, I opted to directly engage with SDKs or APIs for a cleaner integration. This task can often prove to be challenging due to the often sparse documentation and the need to navigate through source code to uncover the necessary details.

GPU side of the solution

After some investigation, I discovered the NVIDIA Management Library (NVML), a well-structured resource that facilitated a comprehensive understanding of its workings. I would speculate NVML serves as the backbone for nvidia-smi, although its code is not open source but still. It offers bindings for Python, Go, and C, but notably, the Go binding stands out as the only one directly maintained by the Nvidia team, making it an obvious choice for this project. The Go binding is accessible here for those interested in exploring further.

Containers side of the solution

On the containers side the idea was to interact with the Containerd API, since this is what Docker as well as Kubernetes uses as a default high-level runtime — so I would solve two problems at a time. But this turned to not be possible — at least what I was able to find. The issue is that if you spawn a container using Docker, its (not the ID!) is only stored in Docker, but not seen through or which directly interact with containerd. The container name is also not seen through Linux processes under but this are just ideas that I got, in order to make the solution more flexible rather than just Docker or Kubernetes specific. So that was a blocker, but still that doesn’t prevent me to provide some configuration file to the user so one can set in what kind of environment this prometheus exporter will be ran in. But we’ll get to this next time.

Initially, I intended to leverage the Containerd API for container-side interaction, as both Docker and Kubernetes rely on it as a default runtime, offering the potential to address multiple challenges simultaneously. However, this approach proved unfeasible based on my findings. The issue stemmed from the fact that when a container is spawned using Docker, its name (not its ID) is solely stored within Docker, making it inaccessible through tools like nerdctl or ctr, which interact directly with Containerd. Moreover, the container name isn’t retrievable through Linux processes under /proc. Despite this setback, I remain committed to enhancing flexibility by offering users a configuration file option, enabling them to specify the environment in which the Prometheus exporter operates. Further exploration of this aspect will be covered in a subsequent installment.

Below, we’ll look at the Docker specific implementation. By avoiding reliance on thedocker and nvidia-smi CLI, we sidestep the need to parse complex outputs, ensuring cleaner, more maintainable code. This approach underscores the importance of clarity and efficiency in software development, paving the way for streamlined monitoring of GPU usage within Docker containers.

⚠️ Note: Please check out the code comments, because I’m not explicitly discussing the code later.

package main

import (
"context"
"fmt"
"log"
"net/http"
"time"

"github.com/docker/docker/api/types/container"
"github.com/docker/docker/client"
"github.com/NVIDIA/go-nvml/pkg/nvml"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
containerGpuMemoryUsed = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "docker_gpu_memory_usage",
Help: "GPU memory used by Docker container",
},
[]string{"pid", "container_id", "container"},
)
containerGpuMemoryPercUsed = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "docker_gpu_memory_perc_usage",
Help: "GPU memory in percentage used by container",
},
[]string{"pid", "container_id", "container"},
)
)

func main() {
// Register Prometheus metrics
reg := prometheus.NewRegistry()
reg.MustRegister(containerGpuMemoryUsed)
reg.MustRegister(containerGpuMemoryPercUsed)

// Initialize NVML
ret := nvml.Init()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to initialize NVML: %v", nvml.ErrorString(ret))
}
defer func() {
ret := nvml.Shutdown()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to shutdown NVML: %v", nvml.ErrorString(ret))
}
}()

// Create a Docker client
cli, err := client.NewClientWithOpts(client.FromEnv)
if err != nil {
log.Fatalf("Failed to create Docker client: %v", err)
}
defer cli.Close()

// Start Prometheus metrics server
go func() {
handler := promhttp.HandlerFor(reg, promhttp.HandlerOpts{})
http.Handle("/metrics", handler)
log.Fatal(http.ListenAndServe(":8000", nil))
}()

for {
// List running containers
containers, err := cli.ContainerList(context.Background(), container.ListOptions{All: true})
if err != nil {
log.Fatalf("Failed to list containers: %v", err)
}

// Get device count
count, ret := nvml.DeviceGetCount()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to get device count: %v", nvml.ErrorString(ret))
}

// Iterate over devices
for di := 0; di < count; di++ {
device, ret := nvml.DeviceGetHandleByIndex(di)
if ret != nvml.SUCCESS {
log.Fatalf("Unable to get device at index %d: %v", di, nvml.ErrorString(ret))
}

// Get Device Memory Info
memoryInfo, ret := device.GetMemoryInfo()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to get device memory at index %d: %v", di, nvml.ErrorString(ret))
}

// Get running processes on device
processInfos, ret := device.GetComputeRunningProcesses()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to get process info for device at index %d: %v", di, nvml.ErrorString(ret))
}

// Iterate over running processes
for _, processInfo := range processInfos {
// Iterate over containers
for _, container := range containers {
// Inspect each container to get detailed information
containerInfo, err := cli.ContainerInspect(context.Background(), container.ID)
if err != nil {
log.Printf("Failed to inspect container %s: %v", container.ID, err)
continue
}

// Extract required information
pid := containerInfo.State.Pid
containerID := container.ID[:12]
containerName := containerInfo.Name

if pid == int(processInfo.Pid) {
// Set Prometheus metrics
containerGpuMemoryUsed.WithLabelValues(fmt.Sprintf("%d", pid), containerID, containerName).Set(float64(processInfo.UsedGpuMemory))
percent := (float64(processInfo.UsedGpuMemory) / float64(memoryInfo.Total)) * 100
containerGpuMemoryPercUsed.WithLabelValues(fmt.Sprintf("%d", pid), containerID, containerName).Set(percent)
}
}
}
}
time.Sleep(30 * time.Second)
}
}

Conclusion

In conclusion, tackling the challenge of monitoring GPU usage per Docker container required a blend of innovation and technical prowess. By leveraging the capabilities of the NVIDIA Management Library (NVML) and navigating the complexities of container APIs, this custom Prometheus exporter stands as a testament to the evolving needs in cloud computing and AI application management. It underscores the importance of precise resource monitoring in a shared hardware environment, paving the way for more efficient and effective use of GPU resources across multiple containers. This exploration not only fills a crucial gap in current monitoring tools but also sets the stage for further advancements in containerized application performance analytics.

To stay current with the latest cloud technologies, make sure to subscribe to my weekly newsletter, Cloud Chirp. 🚀

--

--