Skip to content

bug: gateway start times out before PKI generation on first run #433

@riyadhctg

Description

@riyadhctg

Agent Diagnostic

  • Loaded debug-openshell-cluster skill from .agents/skills/
  • Ran openshell status → "tls handshake eof" (server not running)
  • Ran openshell doctor logs --lines 50 → orphaned cgroup cleanup, no errors
  • Ran openshell doctor exec -- kubectl get namespaces → openshell namespace exists
  • Ran openshell doctor exec -- kubectl -n openshell get secrets → TLS secrets
    missing
  • Ran openshell doctor exec -- kubectl -n openshell describe pod openshell-0:
    • FailedMount: secret "openshell-server-tls" not found
    • FailedMount: secret "openshell-server-client-ca" not found
  • Root cause: CLI timed out at wait_for_namespace() before reaching
    reconcile_pki() step
  • K3s was still initializing (~2 min on first run with image pulls)
  • Workaround: manually generated PKI with openssl, applied secrets, ran openshell gateway add --local

Description

On first run, openshell gateway start times out waiting for the openshell namespace
(~120s) while K3s is still initializing. The CLI exits before reaching the PKI
generation step in reconcile_pki().

The container keeps running and K3s eventually creates the namespace, but without
TLS secrets the openshell-0 pod is stuck in ContainerCreating with FailedMount
errors. The gateway never becomes healthy.

Expected: Gateway starts successfully with TLS secrets created.
Actual: Timeout at namespace wait, PKI step skipped, cluster left in broken state.

Reproduction Steps

  1. Fresh install: curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
  2. Run openshell gateway start
  3. Observe timeout error: "K8s namespace not ready"
  4. Container keeps running but openshell status shows "tls handshake eof"
  5. openshell doctor exec -- kubectl -n openshell get secrets shows no TLS secrets

Environment

  • OS: macOS (Apple Silicon, Darwin 25.3.0)
  • Docker: Colima + Docker Engine 27.4.0 (4 CPUs, 8GB RAM)
  • OpenShell: 0.0.11-dev.2+g1d071b8d9

Logs

Deploying local gateway openshell...
    Checking Docker
    Downloading gateway
    Initializing environment
  Error:   × K8s namespace not ready
    ╰─▶ timed out waiting for namespace 'openshell' to exist: Error from server
  (NotFound):
        namespaces "openshell" not found

  # After timeout, pod status:
  $ openshell doctor exec -- kubectl -n openshell describe pod openshell-0
  Events:
    Warning  FailedMount  kubelet  MountVolume.SetUp failed for volume "tls-client-ca"
   : secret "openshell-server-client-ca" not found
    Warning  FailedMount  kubelet  MountVolume.SetUp failed for volume "tls-cert" :
  secret "openshell-server-tls" not found

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds-agent-triageOpened without agent diagnostics — redirect to agent-first workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions