feat(orchestrator): drain sandboxes during shutdown by wj-e2b · Pull Request #3005 · e2b-dev/infra

wj-e2b · 2026-06-13T02:16:44Z

Introduce a shared draingate.Gate (counter plus notification channel) and use it in the sandbox factory and gRPC server to reject new sandbox starts while draining, wait for in-flight starts, and drain or force-stop live sandboxes before closers run. Forced shutdown preserves buffered close errors on context cancellation and avoids duplicate final-pass errors. Adds utils.WaitGroupWait to wait on a WaitGroup with context cancellation.

The graceful drain phase is bounded by SHUTDOWN_DRAIN_TIMEOUT; when it expires the drain escalates to a forced sandbox shutdown. By default the drain waits forever, until sandboxes exit on their own or a force-stop API call empties the node.

Introduce a shared draingate.Gate (counter plus notification channel) and use it in the sandbox factory and gRPC server to reject new sandbox starts while draining, wait for in-flight starts, and drain or force-stop live sandboxes before closers run. Forced shutdown preserves buffered close errors on context cancellation and avoids duplicate final-pass errors. Adds utils.WaitGroupWait to wait on a WaitGroup with context cancellation. The graceful drain phase is bounded by SHUTDOWN_DRAIN_TIMEOUT; when it expires the drain escalates to a forced sandbox shutdown. By default the drain waits forever, until sandboxes exit on their own or a force-stop API call empties the node.

cursor · 2026-06-13T02:16:50Z

PR Summary

Medium Risk
Changes core orchestrator shutdown and sandbox admission paths; incorrect drain ordering or force-stop behavior could strand VMs or reject valid traffic during deploys, though behavior is heavily covered by new tests.

Overview
Orchestrator shutdown now runs an explicit sandbox drain before other closers: the node marks draining, rejects new create/checkpoint starts with gRPC Unavailable, and uses a shared draingate on the gRPC server and sandbox factory to wait for in-flight starts in a defined order (server gate first, then factory, so checkpoint resume is not broken).

Graceful shutdown polls until live sandboxes reach zero and waits on lifecycle cleanup; SHUTDOWN_DRAIN_TIMEOUT optionally caps that phase and triggers a timed force stop that closes remaining VMs concurrently. FORCE_STOP skips graceful drain. Minor shutdown polish includes ignoring EINVAL on logger sync and treating expected service-done errors as non-fatal.

^{Reviewed by Cursor Bugbot for commit 5d6b2e7. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5d6b2e7. Configure here.}

cursor · 2026-06-13T02:17:49Z


+	if err := s.rejectIfDraining(ctx, "sandbox-create-before-start"); err != nil {
+		return nil, err
+	}


Draining checks abort in-flight starts

Medium Severity

Handlers that already acquired the server drain gate via enterSandboxStart can still fail when rejectIfDraining or waitForAcquire runs after shutdown calls StartDraining. That rejects work the drain gate was meant to let finish, including long create and checkpoint paths before ResumeSandbox.

Additional Locations (1)

packages/orchestrator/pkg/server/utils.go#L67-L71

^{Reviewed by Cursor Bugbot for commit 5d6b2e7. Configure here.}

gemini-code-assist

Code Review

Compilation errors exist in packages/orchestrator/pkg/server/main.go and packages/orchestrator/pkg/draingate/gate_test.go where wg.Go() is called on a sync.WaitGroup which does not have a Go method. In packages/orchestrator/pkg/server/main.go, calling context.WithoutCancel(ctx) strips the configured shutdown deadline, which can cause the orchestrator to hang indefinitely if a sandbox cleanup hangs. Additionally, the spin-loop using runtime.Gosched() in packages/shared/pkg/utils/waitgroup.go is flaky and can cause premature context cancellation errors under heavy CPU load.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-13T02:18:26Z

+			wg.Go(func() {
+				sbxLog := logger.L().With(
+					logger.WithSandboxID(sbx.Runtime.SandboxID),
+					logger.WithLifecycleID(sbx.LifecycleID),
+					logger.WithSandboxIP(sbx.Slot.HostIPString()),
+				)
+				sbxLog.Warn(cleanupCtx, "force stopping sandbox during orchestrator shutdown")
+
+				marked := s.sandboxFactory.Sandboxes.MarkStopping(cleanupCtx, sbx.Runtime.SandboxID, sbx.LifecycleID)
+				if !marked {
+					sbxLog.Info(cleanupCtx, "sandbox was already removed from live map before force stop")
+				}
+
+				if err := sbx.Close(cleanupCtx); err != nil {
+					errCh <- fmt.Errorf("close sandbox %s/%s: %w", sbx.Runtime.SandboxID, sbx.LifecycleID, err)
+					sbxLog.Error(cleanupCtx, "failed to force close sandbox", zap.Error(err))
+				}
+
+				sbxLog.Info(cleanupCtx, "forced sandbox cleanup complete")
+			})


The sync.WaitGroup type does not have a Go method, which will cause a compilation error. To run the cleanup concurrently, you should use standard wg.Add(1) before the goroutine and defer wg.Done() inside it.

wg.Add(1) go func() { defer wg.Done() sbxLog := logger.L().With( logger.WithSandboxID(sbx.Runtime.SandboxID), logger.WithLifecycleID(sbx.LifecycleID), logger.WithSandboxIP(sbx.Slot.HostIPString()), ) sbxLog.Warn(cleanupCtx, "force stopping sandbox during orchestrator shutdown") marked := s.sandboxFactory.Sandboxes.MarkStopping(cleanupCtx, sbx.Runtime.SandboxID, sbx.LifecycleID) if !marked { sbxLog.Info(cleanupCtx, "sandbox was already removed from live map before force stop") } if err := sbx.Close(cleanupCtx); err != nil { errCh <- fmt.Errorf("close sandbox %s/%s: %w", sbx.Runtime.SandboxID, sbx.LifecycleID, err) sbxLog.Error(cleanupCtx, "failed to force close sandbox", zap.Error(err)) } sbxLog.Info(cleanupCtx, "forced sandbox cleanup complete") }()

gemini-code-assist · 2026-06-13T02:18:26Z

+	for range 100 {
+		wg.Go(func() {
+			<-start
+			release, err := g.Enter()
+			if err != nil {
+				rejected <- err
+
+				return
+			}
+
+			entered <- release
+		})
+	}


The sync.WaitGroup type does not have a Go method, which will cause a compilation error in this test. You should use standard wg.Add(1) and defer wg.Done() to manage the concurrent test goroutines.

Suggested change

for range 100 {

wg.Go(func() {

<-start

release, err := g.Enter()

if err != nil {

rejected <- err

return

}

entered <- release

})

}

for range 100 {

wg.Add(1)

go func() {

defer wg.Done()

<-start

release, err := g.Enter()

if err != nil {

rejected <- err

return

}

entered <- release

}()

}

gemini-code-assist · 2026-06-13T02:18:26Z

+	stopped := make(map[string]struct{})
+	var errs []error
+
+	cleanupCtx := context.WithoutCancel(ctx)


Calling context.WithoutCancel(ctx) here strips the 30-second forceShutdownTimeout deadline that was explicitly configured during shutdown. This causes the subsequent sandbox close operations to run without any timeout, meaning the orchestrator can hang indefinitely if a sandbox cleanup hangs. You should use the passed ctx directly to ensure the force-shutdown phase is properly bounded.

Suggested change

cleanupCtx := context.WithoutCancel(ctx)

cleanupCtx := ctx

gemini-code-assist · 2026-06-13T02:18:26Z

+	for range 10 {
+		select {
+		case <-done:
+			return nil
+		default:
+			runtime.Gosched()
+		}
+	}
+
+	select {
+	case <-ctx.Done():
+		return fmt.Errorf("waiting for wait group: %w", ctx.Err())
+	case <-done:
+		return nil
+	}


The spin-loop using runtime.Gosched() is flaky and does not guarantee that the goroutine closing done will be scheduled and executed within 10 iterations. Under heavy CPU load or thread starvation, the loop can easily fall through to the select block and return a context error even if the WaitGroup was already completed, making the associated test fragile. You should use a standard select block to wait for either the WaitGroup completion or context cancellation.

select { case <-done: return nil case <-ctx.Done(): return fmt.Errorf("waiting for wait group: %w", ctx.Err()) }

codecov · 2026-06-13T02:18:27Z

Codecov Report

❌ Patch coverage is 55.42725% with 193 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
packages/orchestrator/pkg/server/main.go	45.87%	96 Missing and 9 partials ⚠️
packages/orchestrator/pkg/factories/run.go	10.86%	41 Missing ⚠️
packages/orchestrator/pkg/sandbox/sandbox.go	56.09%	17 Missing and 1 partial ⚠️
packages/orchestrator/pkg/server/sandboxes.go	0.00%	15 Missing ⚠️
packages/orchestrator/pkg/server/utils.go	66.66%	12 Missing ⚠️
packages/orchestrator/pkg/draingate/gate.go	97.61%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d6b2e7976

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-13T02:18:51Z

 	defer func(g *errgroup.Group) {
 		err := g.Wait()
-		if err != nil {
+		if err != nil && !isServiceDoneError(err) {


Preserve service failures when ignoring shutdown sentinels

When a service exits because f() returned an actual error (for example a non-benign grpcServer.Serve or cmuxServer.Serve failure), startService sends that error on serviceError but then returns only serviceDoneError; the serviceError select branch only logs it and never flips success to false. With this new guard, the deferred g.Wait() now also ignores the only error value it can see, so Run can report a clean shutdown and remove the lock file after a real service failure. Either preserve the original service error in the errgroup result or mark success = false when the serviceError branch receives a non-nil error.

Useful? React with 👍 / 👎.

wj-e2b requested review from ValentaTomas, dobrac and jakubno as code owners June 13, 2026 02:16

cla-bot Bot added the cla-signed label Jun 13, 2026

cursor Bot reviewed Jun 13, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): drain sandboxes during shutdown#3005

feat(orchestrator): drain sandboxes during shutdown#3005
wj-e2b wants to merge 1 commit into
mainfrom
wj-orch-fix-4

wj-e2b commented Jun 13, 2026

Uh oh!

cursor Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

codecov Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wj-e2b commented Jun 13, 2026

Uh oh!

cursor Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Draining checks abort in-flight starts

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Jun 13, 2026 •

edited

Loading

codecov Bot commented Jun 13, 2026 •

edited

Loading