Reset RunWorkload retry counter after stable run#5172
Open
gkatz2 wants to merge 1 commit intostacklok:mainfrom
Open
Reset RunWorkload retry counter after stable run#5172gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2 wants to merge 1 commit intostacklok:mainfrom
Conversation
When the Docker daemon is repeatedly unresponsive, the RunWorkload retry loop's 10-attempt budget can be exhausted by short-lived but successful starts: the container exits, the manager retries, the new container runs healthily for tens of seconds before the next disruption kills it, and the cycle repeats. After 10 cycles the manager logs "failed to restart after max attempts, giving up" and the workload stays offline until manually restarted, even though the original cause was transient. Track the duration of each runner.Run call. If the workload ran past a stable threshold before failing, treat the next attempt as a fresh failure rather than a continuation of the current retry sequence: reset the attempt counter and backoff delay. This preserves the budget's role as a bound on tight crash loops while allowing recovery from sustained external disturbances. Add a small testability seam (runner factory on DefaultManager, defaulting to runner.NewRunner) so the retry loop has unit-test coverage of the boundary cases. Fixes stacklok#5171 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5172 +/- ##
==========================================
+ Coverage 67.53% 67.57% +0.03%
==========================================
Files 601 601
Lines 61093 61123 +30
==========================================
+ Hits 41262 41303 +41
+ Misses 16714 16703 -11
Partials 3117 3117 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the Docker daemon is repeatedly unresponsive,
RunWorkload's 10-attempt retry budget can be exhausted by short-lived but successful starts: each container start succeeds, runs for tens of seconds, gets killed by the next disruption, and counts as one of the 10 attempts. After ~10 cycles the manager logsfailed to restart after max attempts, giving upand the workload stays offline until manually restarted, even though the original cause was transient.This PR tracks the duration of each
runner.Runcall and resets the retry counter when the workload ran past a stable threshold (30s) before failing. The retry budget continues to bound tight crash loops where the workload can never get healthy.I have observed this on five stdio MCP servers (atlassian, gitlab, google-workspace, pagerduty, slack) in my local environment over two months. The bug report has the production log evidence.
Fixes #5171
Type of change
Changes
pkg/workloads/manager.go:runDuration; if it ≥ stable threshold on anErrContainerExitedRestartNeeded, resetattemptcounter and backoff.retryConfigstruct; production callers leave the field nil and get defaults matching the previous hard-coded values (maxRetries=10,initialDelay=5s,maxBackoff=60s,stableRunThreshold=30s).mcpRunnerinterface + factory field onDefaultManagerso the retry loop can be exercised against a fake runner that returnsErrContainerExitedRestartNeededor success after a controlled duration, without needing a real container runtime.WorkloadStatusStartingreason string between stable-run failures and tight-crash-loop failures.pkg/workloads/manager_test.go— 7 table-driven cases covering: all-short-runs exhaust budget; single stable run extends budget; mid-sequence stable resets; boundary==threshold; stable-then-success; non-restart error bails immediately; immediate success.pkg/workloads/mocks/mock_manager.go— regenerated. Mockgen directive switched from source mode to reflect mode so only the publicManagerinterface is mocked; the internalmcpRunnertest seam stays private. Auto-generated header line updated accordingly.Test plan
task test).task lint-fixreports 0 issues.maxRetries=5,initialDelay=2s,maxBackoff=5s,stableRunThreshold=15s) and restored production values before committing.docker kill <container>; sleep 7in a 10-iteration loop against a stdio test server. Each container lifetime was 5–10 seconds — under the 15s shortened threshold. The manager incremented the attempt counter on every cycle and gave up withfailed to restart after max attempts, giving upat attempt 5/5.sleep 18; for i in 1..10; do docker kill <container>; sleep 20; done. Each container ran 20–25 seconds (over the 15s shortened threshold). The attempt counter pinned at 1 across all 10 cycles.Special notes for reviewers
The
attempt = 0assignment in the reset path relies on the for-clause'sattempt++running oncontinue, landing the next iteration atattempt = 1. There is an inline comment explaining this; it's a small wart but the alternatives (askipSleepflag, restructuring the loop) seemed worse. Open to feedback.Generated with Claude Code