Skip to content

fix(orchestrator): pause upload retain retry#2993

Open
jakubno wants to merge 10 commits into
mainfrom
fix/orchestrator-pause-upload-retain-retry
Open

fix(orchestrator): pause upload retain retry#2993
jakubno wants to merge 10 commits into
mainfrom
fix/orchestrator-pause-upload-retain-retry

Conversation

@jakubno

@jakubno jakubno commented Jun 12, 2026

Copy link
Copy Markdown
Member

No description provided.

jakubno added 2 commits June 12, 2026 08:27
orchestrator.snapshot.upload.failed counts pause-snapshot uploads that
never landed durably (budget exhausted or a non-retryable error). A
non-zero rate means lost snapshots.
The pause upload was fire-and-forget: a single 20-min attempt, and on any
failure the snapshot was only logged and never reached storage. Since
descendant builds store only diffs against their ancestors, a lost memfile
is unrecoverable and cascades (later pauses fail on the ancestor-wait;
resumes crash on a UFFD page-fault, object does not exist).

Replace the single attempt with uploadWithRetry: retry with a fresh
per-attempt timeout under a ~2h budget, exponential backoff, and
default-retryable error classification (only NoDiff / object-not-exist /
parent-cancel stop the loop). Re-running is safe (content-addressed
storage, idempotent header swap).

Also:
- redisPeerKeyTTL now covers the full retry window.
- uploadedBuilds is marked only on actual success, so a failed build is
  never advertised as uploaded to peers.
- Increment the snapshot.upload.failed counter when an upload gives up.

Because every pause retries, once a root build's upload lands its memfile
the descendants' ancestor-waits resolve on a later attempt.
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 41.91176% with 79 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/orchestrator/pkg/server/main.go 0.00% 41 Missing ⚠️
packages/orchestrator/pkg/server/sandboxes.go 0.00% 25 Missing ⚠️
packages/orchestrator/pkg/server/upload_retry.go 60.00% 8 Missing ⚠️
packages/shared/pkg/retry/retry.go 93.75% 2 Missing and 1 partial ⚠️
packages/orchestrator/pkg/factories/run.go 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The background upload retry loop in uploadSnapshotAsync is detached from the request context using context.WithoutCancel(ctx), which prevents it from being cancelled when the request finishes. However, because it is completely detached, the retry loop will not abort when the server shuts down, causing the background goroutine to continue retrying for up to two hours. To prevent resource leaks during shutdown, the detached context should be linked to the server's shutdown channel s.done so that it is cancelled immediately when the server stops.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated
@cursor

cursor Bot commented Jun 12, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes core snapshot durability and shutdown timing; mis-tuned waits or the unretryable sync checkpoint path can still lose snapshots or prolong node drains during storage outages.

Overview
Pause snapshot uploads now retry for up to two hours with per-attempt timeouts and exponential backoff via a shared retry.Do helper, and graceful shutdown waits on in-flight uploads instead of cancelling them on the old single 20-minute timeout. Upload coordination TTLs (futureTTL, refreshHeaderBudget, Redis peer keys) are stretched to match that window so child builds and cross-orchestrator waits do not give up while a parent is still retrying; peer “already uploaded” is only set after a successful land, with a metric when uploads fail permanently.

Sync checkpoint (when async P2P checkpoint is off) still runs one upload.Run under uploadTimeout only, so it does not get the new retry behavior. Graceful drain can block shutdown for as long as uploads keep retrying (up to the total budget) unless force-stop cancels the close context.

Reviewed by Cursor Bugbot for commit 186614e. Bugbot is set up for automated code reviews on this repo. Configure here.

@cla-bot cla-bot Bot added the cla-signed label Jun 12, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1c8dbc6. Configure here.

Comment thread packages/orchestrator/pkg/server/sandboxes.go
Comment thread packages/orchestrator/pkg/server/upload_retry.go Outdated
jakubno added 8 commits June 12, 2026 09:47
- Bound the retry loop to uploadTotalBudget: cap each attempt's timeout to
  the remaining budget so a slow attempt can't push total runtime past the
  budget (keeps it within redisPeerKeyTTL).
- Cancel the detached upload goroutine on server shutdown so a long retry
  (up to 2h) doesn't outlive the process; the shutdown watcher exits when
  the upload finishes, so it never leaks.
- Align cross-orchestrator wait budgets with the upload retry window:
  refreshHeaderBudget and futureTTL must be >= uploadTotalBudget, otherwise
  a dependent build's Wait returns a non-retryable 'object does not exist'
  (poll budget expiry) and gives up while the parent is still retrying.
A small reusable retry runner with a total budget, optional per-attempt
timeout, and exponential backoff. The whole call runs under a single
budget context (context.WithTimeoutCause), so per-attempt contexts derive
from it: each attempt is capped to the remaining budget and the loop never
runs past TotalBudget. Distinguishes budget exhaustion (ErrBudgetExhausted)
from caller cancellation via context.Cause.
Replace the bespoke uploadWithRetry loop with the general retry.Do helper.
The server keeps only the upload-specific bits: the retry policy and the
isRetryableUploadErr classifier. Behavior is unchanged (fresh per-attempt
timeout under a 2h budget, default-retryable classification).
Instead of cancelling the detached upload goroutine when the server shuts
down, wait for in-flight uploads to finish so a graceful restart doesn't
drop a snapshot that is still uploading.

- Track async uploads with a WaitGroup (uploadsWG).
- Server.Close now takes a context and waits on the WaitGroup, bounded by
  that context so a forced stop (cancelled close context) still exits
  promptly. The gRPC server is closed before this, so no new uploads start
  during the wait.
While Close waits for in-flight snapshot uploads to finish, log the
remaining count at start, every 10s, and on completion (or when a forced
stop cuts the wait short). Backed by an atomic in-flight counter alongside
the WaitGroup.
@jakubno jakubno marked this pull request as ready for review June 12, 2026 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant