fix(uffd): stop assuming REMOVE events are page-size aligned#3002
fix(uffd): stop assuming REMOVE events are page-size aligned#3002ValentaTomas wants to merge 2 commits into
Conversation
The kernel only guarantees madvise (4 KiB) granularity for UFFD_EVENT_REMOVE ranges; on hugetlbfs the hole punch discards whole hugepages and zeroes partial edges in place while the event reports the full requested range. Rounding the event start down marked live hugepages Zero, which the pause maps to uuid.Nil - zeroing live guest memory on resume. Mark only fully covered blocks Zero and taint partially covered Dirty blocks so the pause re-reads their true content from the memfd. Adds kernel-semantics regression tests for the DONTNEED/REMOVE behavior on shared memfds.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 3f2ea82. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Code Review
This pull request introduces applyRemoveRange to handle unaligned UFFD_EVENT_REMOVE ranges by marking only fully covered blocks as Zero and tracking partially covered Dirty blocks as tainted to be re-read from the memfd during a pause. It also adds comprehensive tests demonstrating these semantics on shared memfds. No review comments were provided, and there is no additional feedback.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Investigation context (EN-978)
The serve loop's REMOVE batch assumed
rm.start/rm.endare aligned to the tracker page size ("UFFD invariant" comment). Empirically false: the kernel only guarantees madvise (4 KiB) granularity, and on hugetlbfs a misalignedMADV_REMOVEpunches only whole hugepages inward while zeroing the partial edges in place — yet the UFFD event reports the full requested range. The old arithmetic (BlockIdxrounds start down, end derived fromlen/pageSize) would then mark a fully live hugepageZero; the pause turns tracker-Zero intoEmpty->uuid.Nil, i.e. zeros over live guest memory on resume. The unmarked kernel-zeroed tail is the inverse hazard (stale parent bytes on resume).Today this is only fenced off by FC's
discard_rangealignment guard (and balloon hint/report chunks are buddy-aligned), so this is hardening against a kernel-verified hazard, not a confirmed production hit — see the includedTestHugetlbMemfd_MisalignedRemove_MarksLivePagesZero, which reproduces the kernel behavior and the old arithmetic's failure on a real hugetlb memfd + userfaultfd.Fix
applyRemoveRange: only blocks fully covered by the event are markedZero; partially coveredDirtyblocks keep their state and are recorded in aremoveTaintedbitmap (with a warn log — this should never fire with current FC).Uffd.DiffMetadatafolds the taint into the dirty set so the export re-reads those blocks' true bytes from the memfd instead of trusting tracker state.remove_shared_memfd_test.goadditionally documents the MAP_SHARED memfd REMOVE/DONTNEED semantics the tracker relies on (content destruction, no-event re-access, stale-hint write race), as kernel-verified regression tests; they self-skip without unprivileged userfaultfd / hugepages.Unit tests cover aligned, head/tail/both-edge misaligned, sub-block, and not-present cases. Full uffd suite passes with a real userfaultfd.