fix(orchestrator): pause upload retain retry#2993
Conversation
orchestrator.snapshot.upload.failed counts pause-snapshot uploads that never landed durably (budget exhausted or a non-retryable error). A non-zero rate means lost snapshots.
The pause upload was fire-and-forget: a single 20-min attempt, and on any failure the snapshot was only logged and never reached storage. Since descendant builds store only diffs against their ancestors, a lost memfile is unrecoverable and cascades (later pauses fail on the ancestor-wait; resumes crash on a UFFD page-fault, object does not exist). Replace the single attempt with uploadWithRetry: retry with a fresh per-attempt timeout under a ~2h budget, exponential backoff, and default-retryable error classification (only NoDiff / object-not-exist / parent-cancel stop the loop). Re-running is safe (content-addressed storage, idempotent header swap). Also: - redisPeerKeyTTL now covers the full retry window. - uploadedBuilds is marked only on actual success, so a failed build is never advertised as uploaded to peers. - Increment the snapshot.upload.failed counter when an upload gives up. Because every pause retries, once a root build's upload lands its memfile the descendants' ancestor-waits resolve on a later attempt.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Code Review
The background upload retry loop in uploadSnapshotAsync is detached from the request context using context.WithoutCancel(ctx), which prevents it from being cancelled when the request finishes. However, because it is completely detached, the retry loop will not abort when the server shuts down, causing the background goroutine to continue retrying for up to two hours. To prevent resource leaks during shutdown, the detached context should be linked to the server's shutdown channel s.done so that it is cancelled immediately when the server stops.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
PR SummaryMedium Risk Overview Sync checkpoint (when async P2P checkpoint is off) still runs one Reviewed by Cursor Bugbot for commit 186614e. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1c8dbc6. Configure here.
- Bound the retry loop to uploadTotalBudget: cap each attempt's timeout to the remaining budget so a slow attempt can't push total runtime past the budget (keeps it within redisPeerKeyTTL). - Cancel the detached upload goroutine on server shutdown so a long retry (up to 2h) doesn't outlive the process; the shutdown watcher exits when the upload finishes, so it never leaks. - Align cross-orchestrator wait budgets with the upload retry window: refreshHeaderBudget and futureTTL must be >= uploadTotalBudget, otherwise a dependent build's Wait returns a non-retryable 'object does not exist' (poll budget expiry) and gives up while the parent is still retrying.
A small reusable retry runner with a total budget, optional per-attempt timeout, and exponential backoff. The whole call runs under a single budget context (context.WithTimeoutCause), so per-attempt contexts derive from it: each attempt is capped to the remaining budget and the loop never runs past TotalBudget. Distinguishes budget exhaustion (ErrBudgetExhausted) from caller cancellation via context.Cause.
Replace the bespoke uploadWithRetry loop with the general retry.Do helper. The server keeps only the upload-specific bits: the retry policy and the isRetryableUploadErr classifier. Behavior is unchanged (fresh per-attempt timeout under a 2h budget, default-retryable classification).
Instead of cancelling the detached upload goroutine when the server shuts down, wait for in-flight uploads to finish so a graceful restart doesn't drop a snapshot that is still uploading. - Track async uploads with a WaitGroup (uploadsWG). - Server.Close now takes a context and waits on the WaitGroup, bounded by that context so a forced stop (cancelled close context) still exits promptly. The gRPC server is closed before this, so no new uploads start during the wait.
While Close waits for in-flight snapshot uploads to finish, log the remaining count at start, every 10s, and on completion (or when a forced stop cuts the wait short). Backed by an atomic in-flight counter alongside the WaitGroup.

No description provided.