Skip to content

fix(gpu): add WSL2 GPU support via CDI mode#441

Open
tyeth wants to merge 4 commits intoNVIDIA:mainfrom
tyeth:fix-wsl2-gpu-support
Open

fix(gpu): add WSL2 GPU support via CDI mode#441
tyeth wants to merge 4 commits intoNVIDIA:mainfrom
tyeth:fix-wsl2-gpu-support

Conversation

@tyeth
Copy link

@tyeth tyeth commented Mar 18, 2026

This is a newly created issue to effectively re-open #411 as my vouched user.
It's probably a less than ideal technique and needs cleanup if it were to be used, it's also semi-tested-and-untested as my llm created this after manually patching containers etc.

Tested on a Frame.work 16" with AMD AI 7 350 CPU (+GPU), w/ NVIDIA RTX 5070 (8GB VRAM) + 96GB DDR5 unified/shared RAM.

See the original issue in NVidia/NemoClaw NVIDIA/NemoClaw#208 (comment)

Summary

  • Detect WSL2 at gateway startup (/dev/dxg present) and automatically configure CDI-based GPU injection
  • Fixes the complete nvidia-device-plugin failure chain on WSL2: NFD can't see PCI, NVML can't init without libdxcore.so, CDI spec missing per-GPU UUID entries
  • All changes are in cluster-entrypoint.sh — no Rust, Dockerfile, or manifest changes needed

What it does

When GPU_ENABLED=true and /dev/dxg exists (WSL2), the entrypoint:

  1. Generates CDI spec via nvidia-ctk cdi generate (auto-detects WSL mode)
  2. Adds per-GPU UUID and index device entries (nvidia-ctk only generates name=all, but the device plugin assigns GPUs by UUID)
  3. Bumps CDI spec version from 0.3.0 to 0.5.0 (library minimum)
  4. Patches the spec to include libdxcore.so (upstream nvidia-ctk bug — nvidia-ctk cdi generate: libdxcore.so not found on WSL2 despite being present nvidia-container-toolkit#1739)
  5. Switches nvidia-container-runtime from auto to cdi mode
  6. Deploys a k3s Job to label the node with pci-10de.present=true (NFD can't detect NVIDIA PCI on WSL2's virtualised bus)

On non-WSL2 hosts, the new code path is never entered (/dev/dxg doesn't exist).

Testing

Verified on:

  • Hardware: Framework 16 laptop, AMD CPU, NVIDIA RTX 5070 (8GB VRAM) + 96GB DDR5 shared
  • OS: WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)
  • Driver: NVIDIA 595.71, CUDA 13.2
  • Result: nvidia-device-plugin 1/1 Running, nvidia.com/gpu: 1 advertised, nvidia-smi works inside sandbox pods, full NemoClaw onboard + sandbox creation + local inference (ollama nemotron 70B) working end-to-end

Related

Agent Investigation

Diagnosed using openshell doctor commands. Full diagnostic chain documented in #404.

🤖 Generated with Claude Code

tyeth added 4 commits March 17, 2026 20:09
…chart

WSL2 virtualises GPU access through /dev/dxg instead of native /dev/nvidia*
device nodes, which breaks the entire NVIDIA k8s device plugin detection
chain. Three changes fix this:

1. Detect WSL2 in cluster-entrypoint.sh and configure CDI mode:
   - Generate CDI spec with nvidia-ctk (auto-detects WSL mode)
   - Patch the spec to include libdxcore.so (nvidia-ctk bug omits it)
   - Switch nvidia-container-runtime from auto to cdi mode
   - Deploy a job to label the node with pci-10de.present=true
     (NFD can't see NVIDIA PCI on WSL2's virtualised bus)

2. Bundle the nvidia-device-plugin Helm chart in the cluster image
   instead of fetching from the upstream GitHub Pages repo at startup.
   The repo URL (nvidia.github.io/k8s-device-plugin/index.yaml)
   currently returns 404.

3. Update the HelmChart CR to reference the bundled local chart
   tarball via the k3s static charts API endpoint.

Closes NVIDIA#404
The upstream Helm repo URL works fine; remove the unnecessary chart
bundling and local reference changes.
WSL2 virtualises GPU access through /dev/dxg instead of native /dev/nvidia*
device nodes, which breaks the entire NVIDIA k8s device plugin detection
chain. This patch detects WSL2 at container startup and applies fixes:

1. Generate CDI spec with nvidia-ctk (auto-detects WSL mode)
2. Add per-GPU UUID and index device entries to CDI spec (nvidia-ctk
   only generates name=all but the device plugin assigns GPUs by UUID)
3. Bump CDI spec version from 0.3.0 to 0.5.0 (library minimum)
4. Patch the spec to include libdxcore.so (nvidia-ctk bug omits it;
   this library bridges Linux NVML to the Windows DirectX GPU Kernel)
5. Switch nvidia-container-runtime from auto to cdi mode
6. Deploy a job to label the node with pci-10de.present=true
   (NFD can't see NVIDIA PCI on WSL2's virtualised bus)

Closes NVIDIA#404
The previous approach used sed to inject GPU UUID entries and libdxcore.so
mounts into the nvidia-ctk-generated CDI spec. This corrupted the YAML
structure (duplicate containerEdits keys) causing CDI device resolution to
fail with "failed to unmarshal CDI Spec".

Replace with writing the complete CDI spec from scratch using a heredoc.
This is more robust and easier to understand. The spec includes:
- /dev/dxg device node
- Per-GPU entries by UUID and index (for device plugin allocation)
- libdxcore.so mount (missing from nvidia-ctk on WSL2)
- All WSL driver store library mounts
- ldcache update hooks for both driver store and libdxcore directories

Tested end-to-end: nemoclaw onboard -> gateway start -> WSL2 fix ->
sandbox create with GPU -> nvidia-smi working inside sandbox pod.
@github-actions
Copy link

github-actions bot commented Mar 18, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@tyeth
Copy link
Author

tyeth commented Mar 18, 2026

I have read the DCO document and I hereby sign the DCO.

@tyeth
Copy link
Author

tyeth commented Mar 18, 2026

recheck

@tyeth tyeth marked this pull request as ready for review March 18, 2026 15:08
@tyeth tyeth requested a review from a team as a code owner March 18, 2026 15:08
@johntmyers
Copy link
Collaborator

Thanks @tyeth. Would it be possible for you to do a fresh smoke test since you mentioned the LLM did this after you already did some manual tuning?

@tyeth
Copy link
Author

tyeth commented Mar 18, 2026

💯, but not for the next couple of days, my yak shaving missions got out of hand and now work must be the priority 😊

@klueska
Copy link

klueska commented Mar 18, 2026

Would it be better to implement a wholistic approach to using CDI, as outlined in #398, rather than building something one-off for WSL2?

@johntmyers
Copy link
Collaborator

I would definitely be supportive of a more wholistic approach esp if we have some NVIDIANs willing to do it!

@tyeth
Copy link
Author

tyeth commented Mar 18, 2026

Would it be better to implement a wholistic approach to using CDI, as outlined in #398, rather than building something on-off for WSL2?

100% agree. This was just going to be lost in the ether if I didn't pr it, and may unblock some GPU hungry souls in the meantime. More of a prod the bear PR to elicit an official solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so

3 participants