Skip to content

docs: add SRE troubleshooting playbooks#2334

Open
martinraumann wants to merge 3 commits into
NVIDIA:mainfrom
martinraumann:docs/sre-troubleshooting-playbooks
Open

docs: add SRE troubleshooting playbooks#2334
martinraumann wants to merge 3 commits into
NVIDIA:mainfrom
martinraumann:docs/sre-troubleshooting-playbooks

Conversation

@martinraumann

Copy link
Copy Markdown
Contributor

Summary

  • Adds SRE-oriented troubleshooting playbooks for NICo stuck objects and production operations.
  • Splits the SRE troubleshooting guide into focused pages for state machine debugging, diagnostic tools, DPU provisioning, host ingestion, health alerts, connectivity, instance/fabric issues, and site-controller health.
  • Wires the new pages into the existing Operations (Day 2) -> Playbooks -> Stuck Objects docs navigation.

Jira

  • FORGE-8239

Notes

  • Based on David Mateer's SRE troubleshooting material.
  • Command examples were scrubbed against the current NICo CLI where possible.
  • Historical carbide_* names are kept only where they are current metric names.

Validation

  • npx fern-api@5.40.1 check passed with 0 errors.
  • npx fern-api@5.40.1 check --warnings shows 2 existing repo/config warnings: unauthenticated redirects check skipped, and accent color contrast ratio.
  • git diff --check passed.

FORGE-8239

Co-authored-by: David Mateer <dmateer@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@martinraumann martinraumann marked this pull request as ready for review June 9, 2026 16:29
@martinraumann martinraumann requested a review from Coco-Ben as a code owner June 9, 2026 16:29
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e4adc6d9-44b3-46c0-97ae-2aff06e49cc4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant