fix(orchestrator): pause upload retain retry by jakubno · Pull Request #2993 · e2b-dev/infra

jakubno · 2026-06-12T08:47:25Z

No description provided.

orchestrator.snapshot.upload.failed counts pause-snapshot uploads that never landed durably (budget exhausted or a non-retryable error). A non-zero rate means lost snapshots.

The pause upload was fire-and-forget: a single 20-min attempt, and on any failure the snapshot was only logged and never reached storage. Since descendant builds store only diffs against their ancestors, a lost memfile is unrecoverable and cascades (later pauses fail on the ancestor-wait; resumes crash on a UFFD page-fault, object does not exist). Replace the single attempt with uploadWithRetry: retry with a fresh per-attempt timeout under a ~2h budget, exponential backoff, and default-retryable error classification (only NoDiff / object-not-exist / parent-cancel stop the loop). Re-running is safe (content-addressed storage, idempotent header swap). Also: - redisPeerKeyTTL now covers the full retry window. - uploadedBuilds is marked only on actual success, so a failed build is never advertised as uploaded to peers. - Increment the snapshot.upload.failed counter when an upload gives up. Because every pause retries, once a root build's upload lands its memfile the descendants' ancestor-waits resolve on a later attempt.

codecov · 2026-06-12T08:48:26Z

Codecov Report

❌ Patch coverage is 41.91176% with 79 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
packages/orchestrator/pkg/server/main.go	0.00%	41 Missing ⚠️
packages/orchestrator/pkg/server/sandboxes.go	0.00%	25 Missing ⚠️
packages/orchestrator/pkg/server/upload_retry.go	60.00%	8 Missing ⚠️
packages/shared/pkg/retry/retry.go	93.75%	2 Missing and 1 partial ⚠️
packages/orchestrator/pkg/factories/run.go	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

gemini-code-assist

Code Review

The background upload retry loop in uploadSnapshotAsync is detached from the request context using context.WithoutCancel(ctx), which prevents it from being cancelled when the request finishes. However, because it is completely detached, the retry loop will not abort when the server shuts down, causing the background goroutine to continue retrying for up to two hours. To prevent resource leaks during shutdown, the detached context should be linked to the server's shutdown channel s.done so that it is cancelled immediately when the server stops.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

cursor · 2026-06-12T08:50:43Z

PR Summary

Medium Risk
Changes core snapshot durability and shutdown timing; mis-tuned waits or the unretryable sync checkpoint path can still lose snapshots or prolong node drains during storage outages.

Overview
Pause snapshot uploads now retry for up to two hours with per-attempt timeouts and exponential backoff via a shared retry.Do helper, and graceful shutdown waits on in-flight uploads instead of cancelling them on the old single 20-minute timeout. Upload coordination TTLs (futureTTL, refreshHeaderBudget, Redis peer keys) are stretched to match that window so child builds and cross-orchestrator waits do not give up while a parent is still retrying; peer “already uploaded” is only set after a successful land, with a metric when uploads fail permanently.

Sync checkpoint (when async P2P checkpoint is off) still runs one upload.Run under uploadTimeout only, so it does not get the new retry behavior. Graceful drain can block shutdown for as long as uploads keep retrying (up to the total budget) unless force-stop cancels the close context.

^{Reviewed by Cursor Bugbot for commit 186614e. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 1c8dbc6. Configure here.}

- Bound the retry loop to uploadTotalBudget: cap each attempt's timeout to the remaining budget so a slow attempt can't push total runtime past the budget (keeps it within redisPeerKeyTTL). - Cancel the detached upload goroutine on server shutdown so a long retry (up to 2h) doesn't outlive the process; the shutdown watcher exits when the upload finishes, so it never leaks. - Align cross-orchestrator wait budgets with the upload retry window: refreshHeaderBudget and futureTTL must be >= uploadTotalBudget, otherwise a dependent build's Wait returns a non-retryable 'object does not exist' (poll budget expiry) and gives up while the parent is still retrying.

A small reusable retry runner with a total budget, optional per-attempt timeout, and exponential backoff. The whole call runs under a single budget context (context.WithTimeoutCause), so per-attempt contexts derive from it: each attempt is capped to the remaining budget and the loop never runs past TotalBudget. Distinguishes budget exhaustion (ErrBudgetExhausted) from caller cancellation via context.Cause.

Replace the bespoke uploadWithRetry loop with the general retry.Do helper. The server keeps only the upload-specific bits: the retry policy and the isRetryableUploadErr classifier. Behavior is unchanged (fresh per-attempt timeout under a 2h budget, default-retryable classification).

Instead of cancelling the detached upload goroutine when the server shuts down, wait for in-flight uploads to finish so a graceful restart doesn't drop a snapshot that is still uploading. - Track async uploads with a WaitGroup (uploadsWG). - Server.Close now takes a context and waits on the WaitGroup, bounded by that context so a forced stop (cancelled close context) still exits promptly. The gRPC server is closed before this, so no new uploads start during the wait.

While Close waits for in-flight snapshot uploads to finish, log the remaining count at start, every 10s, and on completion (or when a forced stop cuts the wait short). Backed by an atomic in-flight counter alongside the WaitGroup.

jakubno added 2 commits June 12, 2026 08:27

feat(orchestrator): add snapshot-upload-failed metric

1877000

orchestrator.snapshot.upload.failed counts pause-snapshot uploads that never landed durably (budget exhausted or a non-retryable error). A non-zero rate means lost snapshots.

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/sandboxes.go Outdated

cla-bot Bot added the cla-signed label Jun 12, 2026

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/server/sandboxes.go

Comment thread packages/orchestrator/pkg/server/upload_retry.go Outdated

jakubno added 8 commits June 12, 2026 09:47

fix: lint

8373fce

clean up

e301f91

lint

186614e

jakubno marked this pull request as ready for review June 12, 2026 11:53

jakubno requested review from ValentaTomas and dobrac as code owners June 12, 2026 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orchestrator): pause upload retain retry#2993

fix(orchestrator): pause upload retain retry#2993
jakubno wants to merge 10 commits into
mainfrom
fix/orchestrator-pause-upload-retain-retry

jakubno commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

cursor Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jakubno commented Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jun 12, 2026 •

edited

Loading

cursor Bot commented Jun 12, 2026 •

edited

Loading