Skip to content

fix(bootstrap): surface diagnostics for K8s namespace not ready failures#466

Merged
drew merged 3 commits intomainfrom
fix-namespace-not-ready-diagnostics
Mar 19, 2026
Merged

fix(bootstrap): surface diagnostics for K8s namespace not ready failures#466
drew merged 3 commits intomainfrom
fix-namespace-not-ready-diagnostics

Conversation

@drew
Copy link
Collaborator

@drew drew commented Mar 19, 2026

Summary

Users hitting "K8s namespace not ready" saw a bare error with zero recovery guidance, despite extensive diagnostic plumbing existing in the codebase. This PR closes three compounding gaps so that every failure path now surfaces actionable diagnosis.

Changes

Gap 1: Non-interactive path had no diagnosis at all

  • deploy_gateway_with_panel non-interactive path (CI, piped output) used bare .await? propagation
  • Now catches the error and runs the same diagnosis + fallback as the interactive path

Gap 2: Interactive path silently dropped unmatched failures

  • diagnose_failure() returned None for the common timeout case (no pattern matched)
  • generic_failure_diagnosis() existed but was never called — there was no else branch
  • Now uses .unwrap_or_else(|| generic_failure_diagnosis(name)) so there's always guidance shown

Gap 3: Container logs were never passed to the diagnosis engine

  • The call was diagnose_failure(name, &err_str, None) — always None
  • Patterns like extension-apiserver-authentication, HEALTHCHECK_NODE_PRESSURE, no default route present could never match unless they appeared in the miette error chain
  • Now fetches 80 lines of container logs via new fetch_gateway_logs() and passes them to the matcher

Additional fixes

  • Timeout error path in wait_for_namespace now includes container logs (like the DNS and crash paths already did)
  • Exec-error-on-final-attempt path now includes container logs and a descriptive message instead of a raw error
  • generic_failure_diagnosis now suggests openshell doctor logs and openshell doctor check before the destroy-and-recreate step, making existing diagnostic tooling discoverable

Testing

  • mise run pre-commit passes
  • All existing unit tests pass (0 failures)
  • E2E tests (requires running cluster)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

The 'K8s namespace not ready' error had three gaps preventing diagnostic
information from reaching users:

1. The non-interactive (CI/piped) code path used bare error propagation
   with no diagnosis at all.
2. The interactive path's pattern matcher returned None for the common
   timeout case, and the generic_failure_diagnosis fallback existed but
   was never called.
3. Container logs were never passed to the diagnosis engine, so patterns
   only visible in logs (node pressure, corrupted state, etc.) could not
   match.

Fix all three by fetching container logs at the CLI error-handling site,
passing them to diagnose_failure, and falling back to
generic_failure_diagnosis when no specific pattern matches. Also add
container logs to the two wait_for_namespace error paths that were
missing them (timeout and exec-error-on-final-attempt), and update the
generic diagnosis to suggest 'openshell doctor' commands.
@drew drew self-assigned this Mar 19, 2026
@drew drew requested a review from a team as a code owner March 19, 2026 04:49
drew added 2 commits March 18, 2026 21:53
… bootstrap message

The auto-bootstrap banner told users to run 'openshell gateway status',
but 'status' is a top-level command, not a gateway subcommand. Running
the suggested command produced 'unrecognized subcommand' error.
Cover the three key behaviors introduced by the diagnostic fix:

- generic_failure_diagnosis suggests doctor logs/check commands
- Plain namespace timeout returns None from diagnose_failure (confirming
  the generic fallback is necessary)
- Container logs enable pattern matching for namespace errors that would
  otherwise go undiagnosed (node pressure, corrupted state, no route,
  network connectivity)
- End-to-end fallback pattern mirrors the actual CLI unwrap_or_else chain
@drew drew added the test:e2e Requires end-to-end coverage label Mar 19, 2026
@drew drew merged commit a4883d8 into main Mar 19, 2026
12 checks passed
@drew drew deleted the fix-namespace-not-ready-diagnostics branch March 19, 2026 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant