Skip to content

bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so #404

@tyeth-ai-assisted

Description

@tyeth-ai-assisted

Description

OpenShell gateway with --gpu fails to make GPUs available to sandboxes when running on WSL2. The nvidia-device-plugin DaemonSet either never schedules (0/0 replicas) or crashes with Failed to initialize NVML: Not Supported.

Environment

  • OS: Windows WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)
  • GPU: NVIDIA GeForce RTX 5070 Laptop GPU (8GB) — dual GPU device (integrated + discrete)
  • Driver: NVIDIA 595.71, CUDA 13.2
  • Docker: 29.1.3
  • OpenShell: 0.0.7
  • k3s: v1.35.2+k3s1
  • NVIDIA Device Plugin: v0.18.2

Steps to Reproduce

  1. Run OpenShell on a WSL2 host with an NVIDIA GPU
  2. openshell gateway start --gpu
  3. Observe nvidia-device-plugin DaemonSet has 0/0 desired replicas
  4. Manually label node with feature.node.kubernetes.io/pci-10de.present=true
  5. Device plugin pod starts but crashes: Failed to initialize NVML: Not Supported

Root Cause

Three cascading issues on WSL2:

1. NFD cannot detect NVIDIA PCI device

WSL2 does not expose PCI topology to the guest kernel. NFD only sees pci-1414.present (Microsoft Hyper-V), never pci-10de.present (NVIDIA). The device plugin DaemonSet's node affinity is never satisfied.

2. nvidia runtime auto mode uses legacy injection

The nvidia container runtime defaults to mode = "auto" which selects legacy injection via nvidia-container-cli. On WSL2, this path does not properly inject libdxcore.so into pods — the critical library that bridges Linux NVML to the Windows DirectX GPU Kernel via /dev/dxg.

3. nvidia-ctk cdi generate misses libdxcore.so

When generating the CDI spec, nvidia-ctk correctly auto-detects WSL mode and selects /dev/dxg, but logs "Could not locate libdxcore.so" despite the library being present at /usr/lib/x86_64-linux-gnu/libdxcore.so. This is an upstream nvidia-container-toolkit bug (filed separately).

Verified Fix

All three changes are required:

# 1. Generate CDI spec and patch in libdxcore.so
openshell doctor exec -- nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml
# Then add libdxcore.so mount to /var/run/cdi/nvidia.yaml

# 2. Switch nvidia runtime to CDI mode
openshell doctor exec -- sed -i 's/mode = "auto"/mode = "cdi"/' /etc/nvidia-container-runtime/config.toml

# 3. Label node for DaemonSet scheduling
openshell doctor exec -- kubectl label node <node> feature.node.kubernetes.io/pci-10de.present=true

After these changes:

  • nvidia-device-plugin: 1/1 Running
  • gpu-feature-discovery: 1/1 Running
  • Node capacity: nvidia.com/gpu: 1
  • nvidia-smi works inside pods, RTX 5070 fully accessible

Suggested Permanent Fix

The cluster-entrypoint.sh should detect WSL2 and apply these automatically:

  1. Detect WSL2 (/dev/dxg exists or kernel version contains WSL2)
  2. Run nvidia-ctk cdi generate (already auto-detects WSL mode)
  3. Patch the CDI spec to include libdxcore.so
  4. Set mode = "cdi" in /etc/nvidia-container-runtime/config.toml
  5. Label the node with pci-10de.present=true

This aligns with #398 (migrating to CDI for GPU injection) — WSL2 is a concrete platform where the legacy runtime stack is broken and CDI is the only viable path.

Agent Investigation

Diagnosed using openshell doctor commands:

  • openshell doctor check — passed (Docker OK)
  • openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin — 0/0 desired
  • openshell doctor exec -- kubectl -n nvidia-device-plugin logs <pod>NVML: Not Supported
  • openshell doctor exec -- nvidia-ctk cdi generate — auto-detected WSL, warned about missing libdxcore.so
  • openshell doctor exec -- nvidia-container-cli info — successfully detected GPU at gateway level
  • Verified fix by patching CDI spec, switching runtime mode, and re-labeling node

Metadata

Metadata

Assignees

No one assigned

    Labels

    os:windowsBug affects Windows hostsstate:triage-neededOpened without agent diagnostics and needs triagetopic:compatibilityCompatibility-related work

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions