-
Notifications
You must be signed in to change notification settings - Fork 291
Description
Description
OpenShell gateway with --gpu fails to make GPUs available to sandboxes when running on WSL2. The nvidia-device-plugin DaemonSet either never schedules (0/0 replicas) or crashes with Failed to initialize NVML: Not Supported.
Environment
- OS: Windows WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)
- GPU: NVIDIA GeForce RTX 5070 Laptop GPU (8GB) — dual GPU device (integrated + discrete)
- Driver: NVIDIA 595.71, CUDA 13.2
- Docker: 29.1.3
- OpenShell: 0.0.7
- k3s: v1.35.2+k3s1
- NVIDIA Device Plugin: v0.18.2
Steps to Reproduce
- Run OpenShell on a WSL2 host with an NVIDIA GPU
openshell gateway start --gpu- Observe
nvidia-device-pluginDaemonSet has 0/0 desired replicas - Manually label node with
feature.node.kubernetes.io/pci-10de.present=true - Device plugin pod starts but crashes:
Failed to initialize NVML: Not Supported
Root Cause
Three cascading issues on WSL2:
1. NFD cannot detect NVIDIA PCI device
WSL2 does not expose PCI topology to the guest kernel. NFD only sees pci-1414.present (Microsoft Hyper-V), never pci-10de.present (NVIDIA). The device plugin DaemonSet's node affinity is never satisfied.
2. nvidia runtime auto mode uses legacy injection
The nvidia container runtime defaults to mode = "auto" which selects legacy injection via nvidia-container-cli. On WSL2, this path does not properly inject libdxcore.so into pods — the critical library that bridges Linux NVML to the Windows DirectX GPU Kernel via /dev/dxg.
3. nvidia-ctk cdi generate misses libdxcore.so
When generating the CDI spec, nvidia-ctk correctly auto-detects WSL mode and selects /dev/dxg, but logs "Could not locate libdxcore.so" despite the library being present at /usr/lib/x86_64-linux-gnu/libdxcore.so. This is an upstream nvidia-container-toolkit bug (filed separately).
Verified Fix
All three changes are required:
# 1. Generate CDI spec and patch in libdxcore.so
openshell doctor exec -- nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml
# Then add libdxcore.so mount to /var/run/cdi/nvidia.yaml
# 2. Switch nvidia runtime to CDI mode
openshell doctor exec -- sed -i 's/mode = "auto"/mode = "cdi"/' /etc/nvidia-container-runtime/config.toml
# 3. Label node for DaemonSet scheduling
openshell doctor exec -- kubectl label node <node> feature.node.kubernetes.io/pci-10de.present=trueAfter these changes:
nvidia-device-plugin: 1/1 Runninggpu-feature-discovery: 1/1 Running- Node capacity:
nvidia.com/gpu: 1 nvidia-smiworks inside pods, RTX 5070 fully accessible
Suggested Permanent Fix
The cluster-entrypoint.sh should detect WSL2 and apply these automatically:
- Detect WSL2 (
/dev/dxgexists or kernel version containsWSL2) - Run
nvidia-ctk cdi generate(already auto-detects WSL mode) - Patch the CDI spec to include
libdxcore.so - Set
mode = "cdi"in/etc/nvidia-container-runtime/config.toml - Label the node with
pci-10de.present=true
This aligns with #398 (migrating to CDI for GPU injection) — WSL2 is a concrete platform where the legacy runtime stack is broken and CDI is the only viable path.
Agent Investigation
Diagnosed using openshell doctor commands:
openshell doctor check— passed (Docker OK)openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin— 0/0 desiredopenshell doctor exec -- kubectl -n nvidia-device-plugin logs <pod>—NVML: Not Supportedopenshell doctor exec -- nvidia-ctk cdi generate— auto-detected WSL, warned about missing libdxcore.soopenshell doctor exec -- nvidia-container-cli info— successfully detected GPU at gateway level- Verified fix by patching CDI spec, switching runtime mode, and re-labeling node