fix(gpu): add WSL2 GPU support via CDI mode#441
Conversation
…chart
WSL2 virtualises GPU access through /dev/dxg instead of native /dev/nvidia*
device nodes, which breaks the entire NVIDIA k8s device plugin detection
chain. Three changes fix this:
1. Detect WSL2 in cluster-entrypoint.sh and configure CDI mode:
- Generate CDI spec with nvidia-ctk (auto-detects WSL mode)
- Patch the spec to include libdxcore.so (nvidia-ctk bug omits it)
- Switch nvidia-container-runtime from auto to cdi mode
- Deploy a job to label the node with pci-10de.present=true
(NFD can't see NVIDIA PCI on WSL2's virtualised bus)
2. Bundle the nvidia-device-plugin Helm chart in the cluster image
instead of fetching from the upstream GitHub Pages repo at startup.
The repo URL (nvidia.github.io/k8s-device-plugin/index.yaml)
currently returns 404.
3. Update the HelmChart CR to reference the bundled local chart
tarball via the k3s static charts API endpoint.
Closes NVIDIA#404
The upstream Helm repo URL works fine; remove the unnecessary chart bundling and local reference changes.
WSL2 virtualises GPU access through /dev/dxg instead of native /dev/nvidia* device nodes, which breaks the entire NVIDIA k8s device plugin detection chain. This patch detects WSL2 at container startup and applies fixes: 1. Generate CDI spec with nvidia-ctk (auto-detects WSL mode) 2. Add per-GPU UUID and index device entries to CDI spec (nvidia-ctk only generates name=all but the device plugin assigns GPUs by UUID) 3. Bump CDI spec version from 0.3.0 to 0.5.0 (library minimum) 4. Patch the spec to include libdxcore.so (nvidia-ctk bug omits it; this library bridges Linux NVML to the Windows DirectX GPU Kernel) 5. Switch nvidia-container-runtime from auto to cdi mode 6. Deploy a job to label the node with pci-10de.present=true (NFD can't see NVIDIA PCI on WSL2's virtualised bus) Closes NVIDIA#404
The previous approach used sed to inject GPU UUID entries and libdxcore.so mounts into the nvidia-ctk-generated CDI spec. This corrupted the YAML structure (duplicate containerEdits keys) causing CDI device resolution to fail with "failed to unmarshal CDI Spec". Replace with writing the complete CDI spec from scratch using a heredoc. This is more robust and easier to understand. The spec includes: - /dev/dxg device node - Per-GPU entries by UUID and index (for device plugin allocation) - libdxcore.so mount (missing from nvidia-ctk on WSL2) - All WSL driver store library mounts - ldcache update hooks for both driver store and libdxcore directories Tested end-to-end: nemoclaw onboard -> gateway start -> WSL2 fix -> sandbox create with GPU -> nvidia-smi working inside sandbox pod.
|
All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO document and I hereby sign the DCO. |
|
recheck |
|
Thanks @tyeth. Would it be possible for you to do a fresh smoke test since you mentioned the LLM did this after you already did some manual tuning? |
|
💯, but not for the next couple of days, my yak shaving missions got out of hand and now work must be the priority 😊 |
|
Would it be better to implement a wholistic approach to using CDI, as outlined in #398, rather than building something one-off for WSL2? |
|
I would definitely be supportive of a more wholistic approach esp if we have some NVIDIANs willing to do it! |
100% agree. This was just going to be lost in the ether if I didn't pr it, and may unblock some GPU hungry souls in the meantime. More of a prod the bear PR to elicit an official solution. |
This is a newly created issue to effectively re-open #411 as my vouched user.
It's probably a less than ideal technique and needs cleanup if it were to be used, it's also semi-tested-and-untested as my llm created this after manually patching containers etc.
Tested on a Frame.work 16" with AMD AI 7 350 CPU (+GPU), w/ NVIDIA RTX 5070 (8GB VRAM) + 96GB DDR5 unified/shared RAM.
See the original issue in NVidia/NemoClaw NVIDIA/NemoClaw#208 (comment)