bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so

## Description

OpenShell gateway with `--gpu` fails to make GPUs available to sandboxes when running on WSL2. The `nvidia-device-plugin` DaemonSet either never schedules (0/0 replicas) or crashes with `Failed to initialize NVML: Not Supported`.

## Environment

- **OS**: Windows WSL2 (Linux 6.6.87.2-microsoft-standard-WSL2)
- **GPU**: NVIDIA GeForce RTX 5070 Laptop GPU (8GB) — dual GPU device (integrated + discrete)
- **Driver**: NVIDIA 595.71, CUDA 13.2
- **Docker**: 29.1.3
- **OpenShell**: 0.0.7
- **k3s**: v1.35.2+k3s1
- **NVIDIA Device Plugin**: v0.18.2

## Steps to Reproduce

1. Run OpenShell on a WSL2 host with an NVIDIA GPU
2. `openshell gateway start --gpu`
3. Observe `nvidia-device-plugin` DaemonSet has 0/0 desired replicas
4. Manually label node with `feature.node.kubernetes.io/pci-10de.present=true`
5. Device plugin pod starts but crashes: `Failed to initialize NVML: Not Supported`

## Root Cause

Three cascading issues on WSL2:

### 1. NFD cannot detect NVIDIA PCI device
WSL2 does not expose PCI topology to the guest kernel. NFD only sees `pci-1414.present` (Microsoft Hyper-V), never `pci-10de.present` (NVIDIA). The device plugin DaemonSet's node affinity is never satisfied.

### 2. nvidia runtime `auto` mode uses legacy injection
The nvidia container runtime defaults to `mode = "auto"` which selects legacy injection via `nvidia-container-cli`. On WSL2, this path does not properly inject `libdxcore.so` into pods — the critical library that bridges Linux NVML to the Windows DirectX GPU Kernel via `/dev/dxg`.

### 3. `nvidia-ctk cdi generate` misses `libdxcore.so`
When generating the CDI spec, `nvidia-ctk` correctly auto-detects WSL mode and selects `/dev/dxg`, but logs `"Could not locate libdxcore.so"` despite the library being present at `/usr/lib/x86_64-linux-gnu/libdxcore.so`. This is an upstream nvidia-container-toolkit bug (filed separately).

## Verified Fix

All three changes are required:

```bash
# 1. Generate CDI spec and patch in libdxcore.so
openshell doctor exec -- nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml
# Then add libdxcore.so mount to /var/run/cdi/nvidia.yaml

# 2. Switch nvidia runtime to CDI mode
openshell doctor exec -- sed -i 's/mode = "auto"/mode = "cdi"/' /etc/nvidia-container-runtime/config.toml

# 3. Label node for DaemonSet scheduling
openshell doctor exec -- kubectl label node <node> feature.node.kubernetes.io/pci-10de.present=true
```

After these changes:
- `nvidia-device-plugin`: 1/1 Running
- `gpu-feature-discovery`: 1/1 Running  
- Node capacity: `nvidia.com/gpu: 1`
- `nvidia-smi` works inside pods, RTX 5070 fully accessible

## Suggested Permanent Fix

The `cluster-entrypoint.sh` should detect WSL2 and apply these automatically:

1. Detect WSL2 (`/dev/dxg` exists or kernel version contains `WSL2`)
2. Run `nvidia-ctk cdi generate` (already auto-detects WSL mode)
3. Patch the CDI spec to include `libdxcore.so`
4. Set `mode = "cdi"` in `/etc/nvidia-container-runtime/config.toml`
5. Label the node with `pci-10de.present=true`

This aligns with #398 (migrating to CDI for GPU injection) — WSL2 is a concrete platform where the legacy runtime stack is broken and CDI is the only viable path.

## Agent Investigation

Diagnosed using `openshell doctor` commands:
- `openshell doctor check` — passed (Docker OK)
- `openshell doctor exec -- kubectl get daemonset -n nvidia-device-plugin` — 0/0 desired
- `openshell doctor exec -- kubectl -n nvidia-device-plugin logs <pod>` — `NVML: Not Supported`
- `openshell doctor exec -- nvidia-ctk cdi generate` — auto-detected WSL, warned about missing libdxcore.so
- `openshell doctor exec -- nvidia-container-cli info` — successfully detected GPU at gateway level
- Verified fix by patching CDI spec, switching runtime mode, and re-labeling node

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so #404

Description

Environment

Steps to Reproduce

Root Cause

1. NFD cannot detect NVIDIA PCI device

2. nvidia runtime `auto` mode uses legacy injection

3. `nvidia-ctk cdi generate` misses `libdxcore.so`

Verified Fix

Suggested Permanent Fix

Agent Investigation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: GPU passthrough fails on WSL2 — NVML init fails without CDI mode and libdxcore.so #404

Description

Description

Environment

Steps to Reproduce

Root Cause

1. NFD cannot detect NVIDIA PCI device

2. nvidia runtime auto mode uses legacy injection

3. nvidia-ctk cdi generate misses libdxcore.so

Verified Fix

Suggested Permanent Fix

Agent Investigation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. nvidia runtime `auto` mode uses legacy injection

3. `nvidia-ctk cdi generate` misses `libdxcore.so`