fix(orchestrator): stop balloon hinting round when pre-pause drain fails#3001
fix(orchestrator): stop balloon hinting round when pre-pause drain fails#3001ValentaTomas wants to merge 1 commit into
Conversation
Pausing the VM only stops vCPUs; FC's event loop keeps processing queued free-page-hint descriptors, each one a MADV_REMOVE mutating guest memory while the snapshot is being taken. On drain timeout the round was left active, so the backlog kept discarding after the pause. Stop the round on any drain failure: FC applies the stop synchronously on the queue-processing thread, after which remaining descriptors are acked without discards.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 40426a3. Bugbot is set up for automated code reviews on this repo. Configure here. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Code Review
This pull request introduces a stopBalloonHinting method to the Firecracker API client and invokes it during a failed balloon drain in Process.DrainBalloon to prevent guest memory corruption. I have no feedback to provide as no issues were identified.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Investigation context (EN-978)
While investigating the resume failures we confirmed (against the FC fork and the v6.1 guest balloon driver) that pausing the VM does not stop the balloon:
pause_vmstops vCPUs only, FC's event loop keeps running (it serves the snapshot/pagemap API calls), and any free-page-hint descriptors still queued are processed after the pause — each one aMADV_REMOVEthat punches holes in the memfd while the snapshot pipeline samples bitmaps and reads memory.DrainBalloonmade this worse: on drain timeout it logged and continued, leaving the hint round active with a backlog. There is also a protocol-level hazard (guest's balloon shrinker may return hinted blocks to its allocator under the memory pressure the hint round itself creates, so a queued hint can be stale and its discard destroys live guest data + its WP-async dirty evidence) — demonstrated in tests on a real userfaultfd in the companion test PR. That part ultimately needs an FC-side fix (don't discard on hints, or reconcile against pagemap).Not 100% confirmed as the EN-978 root cause — this mechanism is dedup-flag-independent, while the failure rate appeared to track the dedup ramp; see PR with repro tests for the open questions and decisive data cuts.
Fix
On any drain failure (timeout/describe error), call the fork's
StopFreePageHinting: FC setshost_cmd=DONEsynchronously on the same thread that processes the queue, after which remaining descriptors are acked without discards. Span outcome becomestimeout-stopped/timeout-stop-failedfor observability.Open question for review: if the stop itself fails we still continue the pause (existing behavior). Arguably the pause should abort then.