Reset RunWorkload retry counter after stable run by gkatz2 · Pull Request #5172 · stacklok/toolhive

gkatz2 · 2026-05-03T23:03:56Z

Summary

When the Docker daemon is repeatedly unresponsive, RunWorkload's 10-attempt retry budget can be exhausted by short-lived but successful starts: each container start succeeds, runs for tens of seconds, gets killed by the next disruption, and counts as one of the 10 attempts. After ~10 cycles the manager logs failed to restart after max attempts, giving up and the workload stays offline until manually restarted, even though the original cause was transient.

This PR tracks the duration of each runner.Run call and resets the retry counter when the workload ran past a stable threshold (30s) before failing. The retry budget continues to bound tight crash loops where the workload can never get healthy.

I have observed this on five stdio MCP servers (atlassian, gitlab, google-workspace, pagerduty, slack) in my local environment over two months. The bug report has the production log evidence.

Fixes #5171

Type of change

Bug fix (non-breaking change which fixes an issue)

Changes

pkg/workloads/manager.go:
- Track runDuration; if it ≥ stable threshold on an ErrContainerExitedRestartNeeded, reset attempt counter and backoff.
- Bundle retry parameters into a retryConfig struct; production callers leave the field nil and get defaults matching the previous hard-coded values (maxRetries=10, initialDelay=5s, maxBackoff=60s, stableRunThreshold=30s).
- Add an mcpRunner interface + factory field on DefaultManager so the retry loop can be exercised against a fake runner that returns ErrContainerExitedRestartNeeded or success after a controlled duration, without needing a real container runtime.
- Differentiate the WorkloadStatusStarting reason string between stable-run failures and tight-crash-loop failures.
pkg/workloads/manager_test.go — 7 table-driven cases covering: all-short-runs exhaust budget; single stable run extends budget; mid-sequence stable resets; boundary ==threshold; stable-then-success; non-restart error bails immediately; immediate success.
pkg/workloads/mocks/mock_manager.go — regenerated. Mockgen directive switched from source mode to reflect mode so only the public Manager interface is mocked; the internal mcpRunner test seam stays private. Auto-generated header line updated accordingly.

Test plan

Unit tests pass (task test).
task lint-fix reports 0 issues.
Manual end-to-end reproduction. To keep iteration fast I rebuilt the binary with shortened retry constants (maxRetries=5, initialDelay=2s, maxBackoff=5s, stableRunThreshold=15s) and restored production values before committing.
- Bug repro without the fix: ran docker kill <container>; sleep 7 in a 10-iteration loop against a stdio test server. Each container lifetime was 5–10 seconds — under the 15s shortened threshold. The manager incremented the attempt counter on every cycle and gave up with failed to restart after max attempts, giving up at attempt 5/5.
- Fix verified for stable runs: ran sleep 18; for i in 1..10; do docker kill <container>; sleep 20; done. Each container ran 20–25 seconds (over the 15s shortened threshold). The attempt counter pinned at 1 across all 10 cycles.
- Safety check: re-ran the 7s kill loop with the fix applied. The manager still hit attempt 5/5 and gave up — the threshold-based reset does not enable runaway loops on workloads that never get healthy.

Special notes for reviewers

The attempt = 0 assignment in the reset path relies on the for-clause's attempt++ running on continue, landing the next iteration at attempt = 1. There is an inline comment explaining this; it's a small wart but the alternatives (a skipSleep flag, restructuring the loop) seemed worse. Open to feedback.

Generated with Claude Code

When the Docker daemon is repeatedly unresponsive, the RunWorkload retry loop's 10-attempt budget can be exhausted by short-lived but successful starts: the container exits, the manager retries, the new container runs healthily for tens of seconds before the next disruption kills it, and the cycle repeats. After 10 cycles the manager logs "failed to restart after max attempts, giving up" and the workload stays offline until manually restarted, even though the original cause was transient. Track the duration of each runner.Run call. If the workload ran past a stable threshold before failing, treat the next attempt as a fresh failure rather than a continuation of the current retry sequence: reset the attempt counter and backoff delay. This preserves the budget's role as a bound on tight crash loops while allowing recovery from sustained external disturbances. Add a small testability seam (runner factory on DefaultManager, defaulting to runner.NewRunner) so the retry loop has unit-test coverage of the boundary cases. Fixes stacklok#5171 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Greg Katz <gkatz@indeed.com>

codecov · 2026-05-03T23:12:17Z

Codecov Report

❌ Patch coverage is 94.59459% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.57%. Comparing base (8c90184) to head (d513fd0).

Files with missing lines	Patch %	Lines
pkg/workloads/manager.go	94.59%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5172      +/-   ##
==========================================
+ Coverage   67.53%   67.57%   +0.03%     
==========================================
  Files         601      601              
  Lines       61093    61123      +30     
==========================================
+ Hits        41262    41303      +41     
+ Misses      16714    16703      -11     
  Partials     3117     3117

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gkatz2 requested review from JAORMX, amirejaz and lujunsan as code owners May 3, 2026 23:03

github-actions Bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset RunWorkload retry counter after stable run#5172

Reset RunWorkload retry counter after stable run#5172
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/runworkload-retry-counter-reset

gkatz2 commented May 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gkatz2 commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Changes

Test plan

Special notes for reviewers

Uh oh!

codecov Bot commented May 3, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gkatz2 commented May 3, 2026 •

edited

Loading