Skip to content

feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488

Open
drew wants to merge 5 commits intomainfrom
487-gateway-resume-ssh-secret/drew
Open

feat(bootstrap): resume gateway from existing state and persist SSH handshake secret#488
drew wants to merge 5 commits intomainfrom
487-gateway-resume-ssh-secret/drew

Conversation

@drew
Copy link
Collaborator

@drew drew commented Mar 19, 2026

Summary

Add gateway resume from existing Docker volume state and persist the SSH handshake HMAC secret as a Kubernetes Secret, so openshell gateway start recovers gracefully after Docker restarts without losing sandboxes or breaking SSH sessions.

Related Issue

Closes #487

Changes

Gateway Resume

  • Add DeployOptions.resume flag with a resume branch in deploy_gateway_with_logs that falls through to idempotent ensure_* calls instead of erroring or destroying
  • gateway_admin_deploy auto-resumes for stopped/volume-only states; already-running returns immediately; --recreate still destroys
  • Auto-bootstrap (sandbox create) tries resume first, falls back to recreate on failure (logged at warn)
  • Add cleanup_gateway_container for volume-preserving cleanup on resume failure
  • Add unless-stopped Docker restart policy so the container auto-restarts on Docker daemon restart

SSH Handshake Secret Persistence

  • Add reconcile_ssh_handshake_secret in bootstrap — checks if K8s secret exists, reuses if present, generates new if missing (same pattern as TLS PKI reconciliation)
  • Update Helm chart StatefulSet to read OPENSHELL_SSH_HANDSHAKE_SECRET via secretKeyRef instead of plain value
  • Remove secret generation and sed injection from cluster-entrypoint.sh
  • Remove sshHandshakeSecret from HelmChart CR values; add sshHandshakeSecretName to values.yaml
  • Update cluster-deploy-fast.sh to create K8s secret directly via kubectl
  • Add SSH handshake secret existence to cluster health check

Testing

  • mise run pre-commit passes (format, lint, license headers)
  • cargo test --package openshell-bootstrap --package openshell-cli — all 163 tests pass
  • E2E tests (mise run e2e) — requires running cluster; these changes affect sandbox lifecycle and should be validated with a running gateway

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

…andshake secret

Add a resume code path to gateway start so existing Docker volume state
(k3s, etcd, sandboxes, secrets) is reused instead of requiring a full
destroy/recreate cycle. When the container is gone but the volume remains
(e.g. Docker restart), the CLI automatically creates a new container with
the existing volume and reconciles PKI and secrets.

Move the SSH handshake HMAC secret from ephemeral generation in the
cluster entrypoint (regenerated on every container start) to a Kubernetes
Secret that persists in etcd on the Docker volume. This ensures sandbox
SSH sessions survive container restarts.

Key changes:
- Add DeployOptions.resume flag with resume branch in deploy flow
- Add cleanup_gateway_container for volume-preserving failure cleanup
- Auto-resume in gateway_admin_deploy (stopped/volume-only states)
- Auto-bootstrap tries resume first, falls back to recreate
- Add unless-stopped Docker restart policy to gateway container
- Reconcile SSH handshake secret as K8s Secret alongside TLS PKI
- Update Helm chart to read secret via secretKeyRef
- Add SSH handshake secret to cluster health check

Closes #487
@drew drew requested a review from a team as a code owner March 19, 2026 21:59
@drew drew added area:gateway Gateway server and control-plane work area:cluster Related to running OpenShell on k3s/docker labels Mar 19, 2026
@drew drew self-assigned this Mar 19, 2026
@drew drew added the test:e2e Requires end-to-end coverage label Mar 19, 2026
johntmyers
johntmyers previously approved these changes Mar 19, 2026
drew added 3 commits March 19, 2026 22:24
On resume after container kill, ensure_network destroys and recreates
the Docker network with a new ID. The stopped container still referenced
the old network ID, causing 'network not found' on start. Fix by
reconciling the container's network attachment in ensure_container.

Also, reconcile_pki was attempting to load K8s secrets before k3s had
booted, failing transiently, and regenerating PKI unnecessarily. This
triggered a server rollout restart causing TLS errors. Fix by waiting
for the openshell namespace before attempting to read existing secrets.

Add gRPC readiness check to gateway_admin_deploy so the CLI waits for
the server to accept connections before declaring the gateway ready.

Add e2e test covering container kill, stale network, sandbox persistence,
and sandbox create after resume.
The wait_for_healthy helper checked for 'healthy', 'running', or '✓'
but openshell status outputs 'Connected'. All five gateway_resume tests
were failing because the health check never matched.
…ternally

The deploy flow now auto-detects whether to resume by checking for
existing gateway state inside deploy_gateway_with_logs. Callers no
longer need to compute and pass a resume flag. The explicit gateway
start path still short-circuits for already-running gateways to avoid
redundant work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:cluster Related to running OpenShell on k3s/docker area:gateway Gateway server and control-plane work test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: gateway resume from existing state and persistent SSH handshake secret

2 participants