Skip to content

fix(bootstrap): surface Helm install failure on namespace timeout (#211)#486

Open
Manoj-engineer wants to merge 1 commit intoNVIDIA:mainfrom
Manoj-engineer:fix/helm-error-diagnosis-211
Open

fix(bootstrap): surface Helm install failure on namespace timeout (#211)#486
Manoj-engineer wants to merge 1 commit intoNVIDIA:mainfrom
Manoj-engineer:fix/helm-error-diagnosis-211

Conversation

@Manoj-engineer
Copy link

Vouched at: #420

Summary

when gateway start times out waiting for the openshell namespace, the error
message now checks for failed helm-install-* jobs in kube-system and surfaces
the actual Helm error and last 30 log lines instead of the generic "namespace not ready" message.

Related Issue

Fixes #211

Changes

  • Add diagnose_helm_failure() in openshell-bootstrap/src/lib.rs that queries
    helm-install-* jobs in kube-system for failed pods and returns job conditions
    • last 30 log lines
  • Wire into wait_for_namespace() final timeout branch
  • Fix awk filter: status.failed stays <none> during backoff retry window;
    filter on != "0" instead of != "<none>" && != "0" to catch actively-failing jobs
  • Add unit test helm_failure_hint_is_included_in_namespace_timeout_message

Testing

  • mise run pre-commit passes
  • Unit tests added/updated — 78/78 pass
  • E2E tests added/updated (not applicable — error path only)
  • Live-tested end-to-end: built a gateway image with corrupted serviceaccount.yaml,
    confirmed the Helm error appears in the terminal output on timeout

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (not applicable)

…IDIA#211)

Signed-off-by: Manoj-engineer <194872717+Manoj-engineer@users.noreply.github.com>
@Manoj-engineer Manoj-engineer requested a review from a team as a code owner March 19, 2026 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve error message when Helm chart has malformed YAML

1 participant