Feature Request: GPU Support

This feature request aims to enhance the Node Problem Detector with the ability to monitor GPUs on nodes and detect issues.

Currently NPD does not have direct visibility into GPUs. However, many workloads are GPU accelerated which makes GPU health an important part of node health. e.g. GPUs are widely used in machine learning training and inference. Especially for LLM training which may using tens of thousands of GPU cards. The entire training cluster should be restarted from previous checkpoint if any one of the GPUs in the cluster is gone bad.

This feature request adds the following capabilities:

- GPU device monitoring: NPD will collect GPU device info periodically and look for crashes or errors via nvidia-smi/nvml/dcgm tools.

- GPU device monitoring: NPD will check GPU device info periodically to detect if a GPU is "stuck" (e.g. nvidia-smi command hangs).

- TBD: GPU runtime monitoring: NPD will check for crashes or OOM issues reported in nvidia logs.

Specifically, this feature request includes:
- Code for the gpu_monitor plugin
- A Dockerfile to build an NPD image with GPU support
- Other dependencies

Looking forward to your feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: GPU Support #833

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: GPU Support #833

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions