feat(bootstrap,cli): switch GPU injection to CDI where supported by elezar · Pull Request #495 · NVIDIA/OpenShell

elezar · 2026-03-20T07:41:28Z

Summary

Switch GPU device injection in cluster bootstrap to use CDI (Container Device Interface) when enabled in Docker (the docker info endpoint returns a non-empty list of CDI spec directories). When this is not the case existing --gpus all NVIDIA DeviceRequest path is used as a fallback. The --gpu flag on gateway start is extended to let users control injection mode and pass explicit CDI device names or force the --gpus all flag.

Related Issue

Part of #398

Changes

feat(bootstrap): Auto-select CDI (driver="cdi", device_ids=["nvidia.com/gpu=all"]) if CDI is enabled; fall back to legacy driver="nvidia" on older daemons or when version is unknown
feat(cli): --gpu now accepts an optional value: omit for auto-select, --gpu=legacy to force legacy, or --gpu=<cdi-device> for an explicit CDI device name (e.g. nvidia.com/gpu=all, nvidia.com/gpu=0)
feat(cli): --device added as an alias for --gpu
Input validation rejects mixing legacy/auto with explicit CDI names or specifying them more than once

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

klueska · 2026-03-20T07:59:02Z

crates/openshell-bootstrap/src/docker.rs

+/// | Input        | Output                                                       |
+/// |--------------|--------------------------------------------------------------|
+/// | `[]`         | `[]`  — no GPU                                               |
+/// | `["legacy"]` | `["legacy"]`  — pass through                                 |
+/// | `["auto"]`   | `["nvidia.com/gpu=all"]` if CDI supported, else `["legacy"]` |
+/// | `[cdi-ids…]` | unchanged                                                    |
+pub(crate) fn resolve_gpu_device_ids(gpu: &[String], docker_version: Option<&str>) -> Vec<String> {


It feels weird to me to overload this flag with legacy and auto if it is meant to be the list of device_ids in the end.

Does it make more sense to add a new flag with the mode that accepts auto. legacy or cdi and then have --gpus (or your new --devices alias) accept the CDI devices if/only if its cdi mode (or auto mode choosing cdi).

Is the concern backwards compatibility with the existing semantics of the --gpu boolean flag?

Yes, the reason I wanted to extend the existing flag is to maintain backward compatibility. I was initially going to add a separate --device flag to mirror what we have done in other runtimes, but this would require more user engagement.

It sould also be noted that --gpu is equivalent to --gpu="auto" so that the UX does not change. I also only added legacy as an option to allow users to explicitly opt out in the cases where CDI injection is not doing what it should. In the medium term, I would expect legacy to be removed entirely (or just mapping to nvidia.com/gpu=all).

github-actions · 2026-03-20T08:07:56Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/OpenShell/pr-preview/pr-495/
Built to branch `gh-pages` at 2026-03-20 09:57 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Use an explicit CDI device request (driver="cdi", device_ids=["nvidia.com/gpu=all"]) when the Docker daemon reports CDI spec directories via GET /info (SystemInfo.CDISpecDirs). This makes device injection declarative and decouples spec generation from consumption. When the daemon reports no CDI spec directories, fall back to the legacy NVIDIA device request (driver="nvidia", count=-1) which relies on the NVIDIA Container Runtime hook. Failure modes for both paths are equivalent: a missing or stale NVIDIA Container Toolkit installation will cause container start to fail. CDI spec generation is out of scope for this change; specs are expected to be pre-generated out-of-band, for example by the NVIDIA Container Toolkit. Signed-off-by: Evan Lezar <elezar@nvidia.com>

The --gpu flag on `gateway start` now accepts an optional value: --gpu Auto-select: CDI on Docker >= 28.2.0, legacy otherwise --gpu=legacy Force the legacy nvidia DeviceRequest (driver="nvidia") Internally, the gpu bool parameter to ensure_container is replaced with a device_ids slice. resolve_gpu_device_ids resolves the "auto" sentinel to a concrete device ID list based on the Docker daemon version, keeping the resolution logic in one place at deploy time. Signed-off-by: Evan Lezar <elezar@nvidia.com>

Explicit CDI device IDs can now be passed: --gpu=nvidia.com/gpu=all single CDI device --gpu=nvidia.com/gpu=0 --gpu=nvidia.com/gpu=1 multiple CDI devices parse_gpu_flag validates the input and rejects mixing legacy/auto with CDI device names or specifying them more than once. Signed-off-by: Evan Lezar <elezar@nvidia.com>

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar requested a review from a team as a code owner March 20, 2026 07:41

elezar self-assigned this Mar 20, 2026

elezar marked this pull request as draft March 20, 2026 07:44

benhadad mentioned this pull request Mar 20, 2026

OpenShell gateway on WSL2/Docker Desktop installs but openshell-0 stays in ContainerCreating because TLS secrets are never created NVIDIA/NemoClaw#333

Open

klueska reviewed Mar 20, 2026

View reviewed changes

elezar force-pushed the feat/cdi-in-cluster branch from b1e6015 to dd2682c Compare March 20, 2026 08:06

elezar force-pushed the feat/cdi-in-cluster branch from dd2682c to aa0c7bb Compare March 20, 2026 08:37

elezar added 5 commits March 20, 2026 10:46

feat(cli): add --device as an alias for --gpu

936bfbf

Signed-off-by: Evan Lezar <elezar@nvidia.com>

test(e2e): extend gateway start help smoke test to cover key flags

f304997

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the feat/cdi-in-cluster branch from 808270c to f304997 Compare March 20, 2026 09:56

elezar marked this pull request as ready for review March 20, 2026 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bootstrap,cli): switch GPU injection to CDI where supported#495

feat(bootstrap,cli): switch GPU injection to CDI where supported#495
elezar wants to merge 5 commits intomainfrom
feat/cdi-in-cluster

elezar commented Mar 20, 2026 •

edited

Loading

Uh oh!

klueska Mar 20, 2026

Uh oh!

elezar Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-20 09:57 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

elezar commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

klueska Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

elezar Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-03-20 09:57 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

elezar commented Mar 20, 2026 •

edited

Loading

github-actions bot commented Mar 20, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-20 09:57 UTC.
Preview will be ready when the GitHub Pages deployment is complete.