fix(gpu): add WSL2 GPU support via CDI mode by tyeth · Pull Request #441 · NVIDIA/OpenShell

tyeth · 2026-03-18T15:05:39Z

This is a newly created issue to effectively re-open #411 as my vouched user.
It's probably a less than ideal technique and needs cleanup if it were to be used, it's also semi-tested-and-untested as my llm created this after manually patching containers etc.

Tested on a Frame.work 16" with AMD AI 7 350 CPU (+GPU), w/ NVIDIA RTX 5070 (8GB VRAM) + 96GB DDR5 unified/shared RAM.

See the original issue in NVidia/NemoClaw NVIDIA/NemoClaw#208 (comment)

Summary

Detect WSL2 at gateway startup (/dev/dxg present) and automatically configure CDI-based GPU injection

Fixes the complete nvidia-device-plugin failure chain on WSL2: NFD can't see PCI, NVML can't init without libdxcore.so, CDI spec missing per-GPU UUID entries

All changes are in cluster-entrypoint.sh — no Rust, Dockerfile, or manifest changes needed

What it does

When GPU_ENABLED=true and /dev/dxg exists (WSL2), the entrypoint:

Generates CDI spec via nvidia-ctk cdi generate (auto-detects WSL mode)

Adds per-GPU UUID and index device entries (nvidia-ctk only generates name=all, but the device plugin assigns GPUs by UUID)

Bumps CDI spec version from 0.3.0 to 0.5.0 (library minimum)

Patches the spec to include libdxcore.so (upstream nvidia-ctk bug — nvidia-ctk cdi generate: libdxcore.so not found on WSL2 despite being present nvidia-container-toolkit#1739)

Switches nvidia-container-runtime from auto to cdi mode

Deploys a k3s Job to label the node with pci-10de.present=true (NFD can't detect NVIDIA PCI on WSL2's virtualised bus)

On non-WSL2 hosts, the new code path is never entered (/dev/dxg doesn't exist).

Testing

Verified on:

Hardware: Framework 16 laptop, AMD CPU, NVIDIA RTX 5070 (8GB VRAM) + 96GB DDR5 shared

OS: WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)

Driver: NVIDIA 595.71, CUDA 13.2

Result: nvidia-device-plugin 1/1 Running, nvidia.com/gpu: 1 advertised, nvidia-smi works inside sandbox pods, full NemoClaw onboard + sandbox creation + local inference (ollama nemotron 70B) working end-to-end

Related

Closes bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so #404

Upstream bug: nvidia-ctk cdi generate: libdxcore.so not found on WSL2 despite being present nvidia-container-toolkit#1739 (nvidia-ctk cdi generate misses libdxcore.so on WSL2)

Related to feat: Use CDI for GPU injection instead of nvidia-container-cli #398 (CDI migration) — WSL2 is a concrete platform where legacy injection is broken and CDI is the only viable path

Agent Investigation

Diagnosed using openshell doctor commands. Full diagnostic chain documented in #404.

🤖 Generated with Claude Code

…chart WSL2 virtualises GPU access through /dev/dxg instead of native /dev/nvidia* device nodes, which breaks the entire NVIDIA k8s device plugin detection chain. Three changes fix this: 1. Detect WSL2 in cluster-entrypoint.sh and configure CDI mode: - Generate CDI spec with nvidia-ctk (auto-detects WSL mode) - Patch the spec to include libdxcore.so (nvidia-ctk bug omits it) - Switch nvidia-container-runtime from auto to cdi mode - Deploy a job to label the node with pci-10de.present=true (NFD can't see NVIDIA PCI on WSL2's virtualised bus) 2. Bundle the nvidia-device-plugin Helm chart in the cluster image instead of fetching from the upstream GitHub Pages repo at startup. The repo URL (nvidia.github.io/k8s-device-plugin/index.yaml) currently returns 404. 3. Update the HelmChart CR to reference the bundled local chart tarball via the k3s static charts API endpoint. Closes NVIDIA#404

The upstream Helm repo URL works fine; remove the unnecessary chart bundling and local reference changes.

WSL2 virtualises GPU access through /dev/dxg instead of native /dev/nvidia* device nodes, which breaks the entire NVIDIA k8s device plugin detection chain. This patch detects WSL2 at container startup and applies fixes: 1. Generate CDI spec with nvidia-ctk (auto-detects WSL mode) 2. Add per-GPU UUID and index device entries to CDI spec (nvidia-ctk only generates name=all but the device plugin assigns GPUs by UUID) 3. Bump CDI spec version from 0.3.0 to 0.5.0 (library minimum) 4. Patch the spec to include libdxcore.so (nvidia-ctk bug omits it; this library bridges Linux NVML to the Windows DirectX GPU Kernel) 5. Switch nvidia-container-runtime from auto to cdi mode 6. Deploy a job to label the node with pci-10de.present=true (NFD can't see NVIDIA PCI on WSL2's virtualised bus) Closes NVIDIA#404

The previous approach used sed to inject GPU UUID entries and libdxcore.so mounts into the nvidia-ctk-generated CDI spec. This corrupted the YAML structure (duplicate containerEdits keys) causing CDI device resolution to fail with "failed to unmarshal CDI Spec". Replace with writing the complete CDI spec from scratch using a heredoc. This is more robust and easier to understand. The spec includes: - /dev/dxg device node - Per-GPU entries by UUID and index (for device plugin allocation) - libdxcore.so mount (missing from nvidia-ctk on WSL2) - All WSL driver store library mounts - ldcache update hooks for both driver store and libdxcore directories Tested end-to-end: nemoclaw onboard -> gateway start -> WSL2 fix -> sandbox create with GPU -> nvidia-smi working inside sandbox pod.

github-actions · 2026-03-18T15:05:53Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

tyeth · 2026-03-18T15:06:37Z

I have read the DCO document and I hereby sign the DCO.

tyeth · 2026-03-18T15:07:28Z

recheck

johntmyers · 2026-03-18T15:45:11Z

Thanks @tyeth. Would it be possible for you to do a fresh smoke test since you mentioned the LLM did this after you already did some manual tuning?

tyeth · 2026-03-18T16:10:13Z

💯, but not for the next couple of days, my yak shaving missions got out of hand and now work must be the priority 😊

klueska · 2026-03-18T16:38:56Z

Would it be better to implement a wholistic approach to using CDI, as outlined in #398, rather than building something one-off for WSL2?

johntmyers · 2026-03-18T17:08:10Z

I would definitely be supportive of a more wholistic approach esp if we have some NVIDIANs willing to do it!

tyeth · 2026-03-18T17:15:57Z

Would it be better to implement a wholistic approach to using CDI, as outlined in #398, rather than building something on-off for WSL2?

100% agree. This was just going to be lost in the ether if I didn't pr it, and may unblock some GPU hungry souls in the meantime. More of a prod the bear PR to elicit an official solution.

tyeth added 4 commits March 17, 2026 20:09

fix(gpu): revert helm chart bundling, keep only WSL2 CDI fix

af1ae24

The upstream Helm repo URL works fine; remove the unnecessary chart bundling and local reference changes.

tyeth marked this pull request as ready for review March 18, 2026 15:08

tyeth requested a review from a team as a code owner March 18, 2026 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gpu): add WSL2 GPU support via CDI mode#441

fix(gpu): add WSL2 GPU support via CDI mode#441
tyeth wants to merge 4 commits intoNVIDIA:mainfrom
tyeth:fix-wsl2-gpu-support

tyeth commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

tyeth commented Mar 18, 2026

Uh oh!

tyeth commented Mar 18, 2026

Uh oh!

johntmyers commented Mar 18, 2026

Uh oh!

tyeth commented Mar 18, 2026

Uh oh!

klueska commented Mar 18, 2026 •

edited

Loading

Uh oh!

johntmyers commented Mar 18, 2026

Uh oh!

tyeth commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tyeth commented Mar 18, 2026

Summary

What it does

Testing

Related

Agent Investigation

Uh oh!

github-actions bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tyeth commented Mar 18, 2026

Uh oh!

tyeth commented Mar 18, 2026

Uh oh!

johntmyers commented Mar 18, 2026

Uh oh!

tyeth commented Mar 18, 2026

Uh oh!

klueska commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johntmyers commented Mar 18, 2026

Uh oh!

tyeth commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 18, 2026 •

edited

Loading

klueska commented Mar 18, 2026 •

edited

Loading