Skip to content

fix(orchestrator): stop balloon hinting round when pre-pause drain fails#3001

Draft
ValentaTomas wants to merge 1 commit into
mainfrom
fix/fph-drain-stop-on-failure
Draft

fix(orchestrator): stop balloon hinting round when pre-pause drain fails#3001
ValentaTomas wants to merge 1 commit into
mainfrom
fix/fph-drain-stop-on-failure

Conversation

@ValentaTomas

Copy link
Copy Markdown
Member

Investigation context (EN-978)

While investigating the resume failures we confirmed (against the FC fork and the v6.1 guest balloon driver) that pausing the VM does not stop the balloon: pause_vm stops vCPUs only, FC's event loop keeps running (it serves the snapshot/pagemap API calls), and any free-page-hint descriptors still queued are processed after the pause — each one a MADV_REMOVE that punches holes in the memfd while the snapshot pipeline samples bitmaps and reads memory.

DrainBalloon made this worse: on drain timeout it logged and continued, leaving the hint round active with a backlog. There is also a protocol-level hazard (guest's balloon shrinker may return hinted blocks to its allocator under the memory pressure the hint round itself creates, so a queued hint can be stale and its discard destroys live guest data + its WP-async dirty evidence) — demonstrated in tests on a real userfaultfd in the companion test PR. That part ultimately needs an FC-side fix (don't discard on hints, or reconcile against pagemap).

Not 100% confirmed as the EN-978 root cause — this mechanism is dedup-flag-independent, while the failure rate appeared to track the dedup ramp; see PR with repro tests for the open questions and decisive data cuts.

Fix

On any drain failure (timeout/describe error), call the fork's StopFreePageHinting: FC sets host_cmd=DONE synchronously on the same thread that processes the queue, after which remaining descriptors are acked without discards. Span outcome becomes timeout-stopped / timeout-stop-failed for observability.

Open question for review: if the stop itself fails we still continue the pause (existing behavior). Arguably the pause should abort then.

Pausing the VM only stops vCPUs; FC's event loop keeps processing
queued free-page-hint descriptors, each one a MADV_REMOVE mutating
guest memory while the snapshot is being taken. On drain timeout the
round was left active, so the backlog kept discarding after the pause.
Stop the round on any drain failure: FC applies the stop synchronously
on the queue-processing thread, after which remaining descriptors are
acked without discards.
@cla-bot cla-bot Bot added the cla-signed label Jun 13, 2026
@cursor

cursor Bot commented Jun 13, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes the pre-pause snapshot path and guest memfd integrity; stop failure still allows pause to proceed upstream, leaving a residual race if stop does not apply.

Overview
When free-page-hinting drain times out or cannot read status before pause/snapshot, the orchestrator now calls Firecracker StopFreePageHinting so an active hint round does not keep processing queued descriptors (which can discard guest memory while the VM is paused and snapshot work runs). A new FC API wrapper mirrors the existing start-hinting 204-as-success handling, and drain failures record span outcomes like timeout-stopped or timeout-stop-failed when stop succeeds or fails.

Reviewed by Cursor Bugbot for commit 40426a3. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 20 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/orchestrator/pkg/sandbox/fc/client.go 0.00% 11 Missing ⚠️
packages/orchestrator/pkg/sandbox/fc/process.go 0.00% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a stopBalloonHinting method to the Firecracker API client and invokes it during a failed balloon drain in Process.DrainBalloon to prevent guest memory corruption. I have no feedback to provide as no issues were identified.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant