Skip to content

feat(bootstrap,cli): switch GPU injection to CDI where supported#495

Open
elezar wants to merge 5 commits intomainfrom
feat/cdi-in-cluster
Open

feat(bootstrap,cli): switch GPU injection to CDI where supported#495
elezar wants to merge 5 commits intomainfrom
feat/cdi-in-cluster

Conversation

@elezar
Copy link
Member

@elezar elezar commented Mar 20, 2026

Summary

Switch GPU device injection in cluster bootstrap to use CDI (Container Device Interface) when enabled in Docker (the docker info endpoint returns a non-empty list of CDI spec directories). When this is not the case existing --gpus all NVIDIA DeviceRequest path is used as a fallback. The --gpu flag on gateway start is extended to let users control injection mode and pass explicit CDI device names or force the --gpus all flag.

Related Issue

Part of #398

Changes

  • feat(bootstrap): Auto-select CDI (driver="cdi", device_ids=["nvidia.com/gpu=all"]) if CDI is enabled; fall back to legacy driver="nvidia" on older daemons or when version is unknown
  • feat(cli): --gpu now accepts an optional value: omit for auto-select, --gpu=legacy to force legacy, or --gpu=<cdi-device> for an explicit CDI device name (e.g. nvidia.com/gpu=all, nvidia.com/gpu=0)
  • feat(cli): --device added as an alias for --gpu
  • Input validation rejects mixing legacy/auto with explicit CDI names or specifying them more than once

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Comment on lines +37 to +43
/// | Input | Output |
/// |--------------|--------------------------------------------------------------|
/// | `[]` | `[]` — no GPU |
/// | `["legacy"]` | `["legacy"]` — pass through |
/// | `["auto"]` | `["nvidia.com/gpu=all"]` if CDI supported, else `["legacy"]` |
/// | `[cdi-ids…]` | unchanged |
pub(crate) fn resolve_gpu_device_ids(gpu: &[String], docker_version: Option<&str>) -> Vec<String> {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird to me to overload this flag with legacy and auto if it is meant to be the list of device_ids in the end.

Does it make more sense to add a new flag with the mode that accepts auto. legacy or cdi and then have --gpus (or your new --devices alias) accept the CDI devices if/only if its cdi mode (or auto mode choosing cdi).

Is the concern backwards compatibility with the existing semantics of the --gpu boolean flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the reason I wanted to extend the existing flag is to maintain backward compatibility. I was initially going to add a separate --device flag to mirror what we have done in other runtimes, but this would require more user engagement.

It sould also be noted that --gpu is equivalent to --gpu="auto" so that the UX does not change. I also only added legacy as an option to allow users to explicitly opt out in the cases where CDI injection is not doing what it should. In the medium term, I would expect legacy to be removed entirely (or just mapping to nvidia.com/gpu=all).

@elezar elezar force-pushed the feat/cdi-in-cluster branch from b1e6015 to dd2682c Compare March 20, 2026 08:06
@github-actions
Copy link

github-actions bot commented Mar 20, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/OpenShell/pr-preview/pr-495/

Built to branch gh-pages at 2026-03-20 09:57 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@elezar elezar force-pushed the feat/cdi-in-cluster branch from dd2682c to aa0c7bb Compare March 20, 2026 08:37
elezar added 5 commits March 20, 2026 10:46
Use an explicit CDI device request (driver="cdi", device_ids=["nvidia.com/gpu=all"])
when the Docker daemon reports CDI spec directories via GET /info (SystemInfo.CDISpecDirs).
This makes device injection declarative and decouples spec generation from consumption.

When the daemon reports no CDI spec directories, fall back to the legacy NVIDIA device
request (driver="nvidia", count=-1) which relies on the NVIDIA Container Runtime hook.
Failure modes for both paths are equivalent: a missing or stale NVIDIA Container Toolkit
installation will cause container start to fail.

CDI spec generation is out of scope for this change; specs are expected to be
pre-generated out-of-band, for example by the NVIDIA Container Toolkit.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
The --gpu flag on `gateway start` now accepts an optional value:

  --gpu           Auto-select: CDI on Docker >= 28.2.0, legacy otherwise
  --gpu=legacy    Force the legacy nvidia DeviceRequest (driver="nvidia")

Internally, the gpu bool parameter to ensure_container is replaced with
a device_ids slice. resolve_gpu_device_ids resolves the "auto" sentinel
to a concrete device ID list based on the Docker daemon version, keeping
the resolution logic in one place at deploy time.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Explicit CDI device IDs can now be passed:

  --gpu=nvidia.com/gpu=all        single CDI device
  --gpu=nvidia.com/gpu=0 --gpu=nvidia.com/gpu=1  multiple CDI devices

parse_gpu_flag validates the input and rejects mixing legacy/auto with
CDI device names or specifying them more than once.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/cdi-in-cluster branch from 808270c to f304997 Compare March 20, 2026 09:56
@elezar elezar marked this pull request as ready for review March 20, 2026 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants