Skip to content

Reset RunWorkload retry counter after stable run#5172

Open
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/runworkload-retry-counter-reset
Open

Reset RunWorkload retry counter after stable run#5172
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/runworkload-retry-counter-reset

Conversation

@gkatz2
Copy link
Copy Markdown
Contributor

@gkatz2 gkatz2 commented May 3, 2026

Summary

When the Docker daemon is repeatedly unresponsive, RunWorkload's 10-attempt retry budget can be exhausted by short-lived but successful starts: each container start succeeds, runs for tens of seconds, gets killed by the next disruption, and counts as one of the 10 attempts. After ~10 cycles the manager logs failed to restart after max attempts, giving up and the workload stays offline until manually restarted, even though the original cause was transient.

This PR tracks the duration of each runner.Run call and resets the retry counter when the workload ran past a stable threshold (30s) before failing. The retry budget continues to bound tight crash loops where the workload can never get healthy.

I have observed this on five stdio MCP servers (atlassian, gitlab, google-workspace, pagerduty, slack) in my local environment over two months. The bug report has the production log evidence.

Fixes #5171

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Changes

  • pkg/workloads/manager.go:
    • Track runDuration; if it ≥ stable threshold on an ErrContainerExitedRestartNeeded, reset attempt counter and backoff.
    • Bundle retry parameters into a retryConfig struct; production callers leave the field nil and get defaults matching the previous hard-coded values (maxRetries=10, initialDelay=5s, maxBackoff=60s, stableRunThreshold=30s).
    • Add an mcpRunner interface + factory field on DefaultManager so the retry loop can be exercised against a fake runner that returns ErrContainerExitedRestartNeeded or success after a controlled duration, without needing a real container runtime.
    • Differentiate the WorkloadStatusStarting reason string between stable-run failures and tight-crash-loop failures.
  • pkg/workloads/manager_test.go — 7 table-driven cases covering: all-short-runs exhaust budget; single stable run extends budget; mid-sequence stable resets; boundary ==threshold; stable-then-success; non-restart error bails immediately; immediate success.
  • pkg/workloads/mocks/mock_manager.go — regenerated. Mockgen directive switched from source mode to reflect mode so only the public Manager interface is mocked; the internal mcpRunner test seam stays private. Auto-generated header line updated accordingly.

Test plan

  • Unit tests pass (task test).
  • task lint-fix reports 0 issues.
  • Manual end-to-end reproduction. To keep iteration fast I rebuilt the binary with shortened retry constants (maxRetries=5, initialDelay=2s, maxBackoff=5s, stableRunThreshold=15s) and restored production values before committing.
    • Bug repro without the fix: ran docker kill <container>; sleep 7 in a 10-iteration loop against a stdio test server. Each container lifetime was 5–10 seconds — under the 15s shortened threshold. The manager incremented the attempt counter on every cycle and gave up with failed to restart after max attempts, giving up at attempt 5/5.
    • Fix verified for stable runs: ran sleep 18; for i in 1..10; do docker kill <container>; sleep 20; done. Each container ran 20–25 seconds (over the 15s shortened threshold). The attempt counter pinned at 1 across all 10 cycles.
    • Safety check: re-ran the 7s kill loop with the fix applied. The manager still hit attempt 5/5 and gave up — the threshold-based reset does not enable runaway loops on workloads that never get healthy.

Special notes for reviewers

The attempt = 0 assignment in the reset path relies on the for-clause's attempt++ running on continue, landing the next iteration at attempt = 1. There is an inline comment explaining this; it's a small wart but the alternatives (a skipSleep flag, restructuring the loop) seemed worse. Open to feedback.

Generated with Claude Code

When the Docker daemon is repeatedly unresponsive, the RunWorkload
retry loop's 10-attempt budget can be exhausted by short-lived but
successful starts: the container exits, the manager retries, the new
container runs healthily for tens of seconds before the next
disruption kills it, and the cycle repeats. After 10 cycles the
manager logs "failed to restart after max attempts, giving up" and
the workload stays offline until manually restarted, even though
the original cause was transient.

Track the duration of each runner.Run call. If the workload ran past
a stable threshold before failing, treat the next attempt as a fresh
failure rather than a continuation of the current retry sequence:
reset the attempt counter and backoff delay. This preserves the
budget's role as a bound on tight crash loops while allowing
recovery from sustained external disturbances.

Add a small testability seam (runner factory on DefaultManager,
defaulting to runner.NewRunner) so the retry loop has unit-test
coverage of the boundary cases.

Fixes stacklok#5171

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions Bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels May 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

❌ Patch coverage is 94.59459% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.57%. Comparing base (8c90184) to head (d513fd0).

Files with missing lines Patch % Lines
pkg/workloads/manager.go 94.59% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5172      +/-   ##
==========================================
+ Coverage   67.53%   67.57%   +0.03%     
==========================================
  Files         601      601              
  Lines       61093    61123      +30     
==========================================
+ Hits        41262    41303      +41     
+ Misses      16714    16703      -11     
  Partials     3117     3117              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stdio MCP server gives up after sustained Docker daemon disruption

1 participant