-
Notifications
You must be signed in to change notification settings - Fork 210
Open
Labels
topic:compatibilityCompatibility-related workCompatibility-related work
Description
Problem Statement
GPU access currently relies on the legacy nvidia-container-runtime +
nvidia-container-cli stack at two layers: once when Docker injects GPUs
into the k3s cluster container, and again when the nvidia-device-plugin +
nvidia-container-runtime inject them into individual sandbox pods.
Proposed Design
Both layers should be migrated to CDI instead. The general idea:
- Generate a CDI spec on the host before starting the cluster:
nvidia-ctk cdi generate - Use Docker's native CDI support (available since Docker 25) to pass GPUs
into the k3s container:--device nvidia.com/gpu=all - Mount
/etc/cdiinto the k3s container, enableenable_cdi_devices = true
in the containerd config, and configure the nvidia-device-plugin to use CDI
device IDs so containerd handles injection natively
CDI is the canonical way NVIDIA supports GPU access in containerized
environments going forward. Some platforms require CDI and are incompatible
with the legacy runtime stack, so this would also broaden the set of platforms
OpenShell can run on. It also makes what gets injected explicit and
auditable via the CDI spec rather than delegating to a CLI with broad host
access.
Alternatives Considered
None
Agent Investigation
No response
Checklist
- I've reviewed existing issues and the architecture docs
- This is a design proposal, not a "please build this" request
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
topic:compatibilityCompatibility-related workCompatibility-related work