feat(orch): collapse envd's heap into 2 MiB hugepages before pause to cut cold-resume faults#2997
feat(orch): collapse envd's heap into 2 MiB hugepages before pause to cut cold-resume faults#2997kalyazin wants to merge 3 commits into
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit d58d318. Bugbot is set up for automated code reviews on this repo. Configure here. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Code Review
The fields of envd.CollapseResult (such as Collapsed, Skipped, Regions, and Chunks) are generated as pointers because they are optional in the OpenAPI schema. Accessing them directly in bestEffortCollapse will cause compile-time errors or potential nil-pointer dereference panics at runtime, so they should be safely dereferenced using a helper like utils.DerefOrDefault before use.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| envdCollapseChunks.Add(ctx, int64(stats.Collapsed), metric.WithAttributes(attribute.String("result", "collapsed"))) | ||
| envdCollapseChunks.Add(ctx, int64(stats.Skipped), metric.WithAttributes(attribute.String("result", "skipped"))) | ||
| span.SetAttributes( | ||
| attribute.Int("collapse.regions", stats.Regions), | ||
| attribute.Int("collapse.chunks", stats.Chunks), | ||
| attribute.Int("collapse.collapsed", stats.Collapsed), | ||
| attribute.Int("collapse.skipped", stats.Skipped), | ||
| ) | ||
|
|
||
| logger.L().Info(ctx, "envd heap collapsed", | ||
| logger.WithSandboxID(s.Runtime.SandboxID), | ||
| zap.Int("regions", stats.Regions), | ||
| zap.Int("chunks", stats.Chunks), | ||
| zap.Int("collapsed", stats.Collapsed), | ||
| zap.Int("skipped", stats.Skipped), | ||
| zap.Int64("duration_ms", elapsedMs), | ||
| ) |
There was a problem hiding this comment.
The fields of envd.CollapseResult are generated as pointers because they are optional in the OpenAPI schema, which causes compile-time errors when converting them directly to integers or passing them to logging and tracing functions. Safely dereference these fields using utils.DerefOrDefault to prevent compilation failures and potential nil-pointer dereference panics.
collapsed := utils.DerefOrDefault(stats.Collapsed, 0)
skipped := utils.DerefOrDefault(stats.Skipped, 0)
regions := utils.DerefOrDefault(stats.Regions, 0)
chunks := utils.DerefOrDefault(stats.Chunks, 0)
envdCollapseChunks.Add(ctx, int64(collapsed), metric.WithAttributes(attribute.String("result", "collapsed")))
envdCollapseChunks.Add(ctx, int64(skipped), metric.WithAttributes(attribute.String("result", "skipped")))
span.SetAttributes(
attribute.Int("collapse.regions", regions),
attribute.Int("collapse.chunks", chunks),
attribute.Int("collapse.collapsed", collapsed),
attribute.Int("collapse.skipped", skipped),
)
logger.L().Info(ctx, "envd heap collapsed",
logger.WithSandboxID(s.Runtime.SandboxID),
zap.Int("regions", regions),
zap.Int("chunks", chunks),
zap.Int("collapsed", collapsed),
zap.Int("skipped", skipped),
zap.Int64("duration_ms", elapsedMs),
)There was a problem hiding this comment.
This is a false positive here. reclaim.go consumes the orchestrator's generated client (packages/orchestrator/pkg/sandbox/envd/envd.gen.go), whose cfg.yaml sets prefer-skip-optional-pointer: true — so CollapseResult fields are plain values (Chunks int, Collapsed int, ElapsedMs int64, …), not pointers. int64(stats.Collapsed) compiles, and utils.DerefOrDefault(stats.Collapsed, 0) would actually fail to compile (can't deref a non-pointer). Confirmed by the passing orchestrator build/lint/ARM64 checks. (The pointer-typed CollapseResult is the envd-side generated type, which is not used on this path.)
| continue | ||
| } | ||
|
|
||
| regions = append(regions, region{start: uintptr(start), end: uintptr(end)}) |
| continue | ||
| } | ||
|
|
||
| regions = append(regions, region{start: uintptr(start), end: uintptr(end)}) |
28e5307 to
080d4c2
Compare
envd's Go heap arenas are physically scattered across many distinct 2 MiB guest-physical frames. On resume each frame is a separate cold UFFD fault, and envd-init touches them serially, so envd's own footprint dominates resume latency. Add a native POST /collapse endpoint that walks /proc/self/maps and madvise(MADV_COLLAPSE)es each anonymous read-write region chunk-by-chunk (2 MiB at a time, skipping empty chunks) after first marking it MADV_HUGEPAGE, consolidating the live pages into transparent hugepages so the post-pause snapshot faults far fewer frames on resume. The collapse is best-effort: per-chunk failures are counted, not fatal. The endpoint returns 200 with the per-call stats (regions/chunks/collapsed/skipped and elapsed time) so the orchestrator can record them as metrics and span attributes; the same stats are logged in-guest. Bump the envd version to 0.6.4 so the orchestrator can gate the call on endpoint availability. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirror the user-cgroup-freeze pre-pause path: when the new collapse-envd-heap
LaunchDarkly flag is set, the orchestrator calls envd's POST /collapse just
before pause so envd consolidates its scattered heap into hugepages and faults
fewer distinct frames on resume. Gate on envd version (>= 0.6.4, which exposes
the endpoint) with a dedicated 10s timeout, and keep it best-effort so a failure
never blocks pause. The flag defaults off and is rolled out via LaunchDarkly.
Observe the collapse on a dedicated envd-collapse span and two metrics: the
orchestrator.sandbox.envd.collapse.duration histogram (round-trip ms, with a
success attribute) for the pause-path cost, and the
orchestrator.sandbox.envd.collapse.chunks counter split by result
(collapsed|skipped) so chunk attempts (total) and successes (collapsed) are both
queryable. The per-call stats are attached as span attributes and logged on
success ("envd heap collapsed") so efficacy is visible in Loki when the metric
counters reset on restart. Add a resume-build -collapse flag that overrides the
LD flag on for local testing.
Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unit-test the /proc/self/maps anonymous-rw parser, integration-test collapseRange against a sparsely-populated mmap'd region (one touched page per 2 MiB window), and smoke-test CollapseSelf on the test process's own heap. Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
25f42bd to
d58d318
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d58d318. Configure here.
| jsonError(w, http.StatusInternalServerError, err) | ||
|
|
||
| return | ||
| } |
There was a problem hiding this comment.
Collapse ignores request cancellation
Medium Severity
PostCollapse runs memory.CollapseSelf() without honoring r.Context(). When the orchestrator’s 10s POST /collapse times out or the client disconnects, envd can keep issuing MADV_COLLAPSE while pause continues, unlike /freeze which uses the request context for its lock.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit d58d318. Configure here.


Why
A cold (GCS-backed) sandbox resume demand-faults guest RAM through userfaultfd at 2 MiB hugepage granularity, and each distinct 2 MiB frame is a separate serial fetch from remote storage (~tens of ms each). On the resume critical path the user cgroup is still frozen (
freeze-user-cgroup), so the only pages faulted before envd answers/initare envd's own. envd is a long-lived Go process whose heap fragments over time across many distinct guest-physical 2 MiB frames at low fill — so restoring envd touches far more frames than its live byte size implies, and that dominates time-to-envd on a cold resume.MADV_COLLAPSEmigrates those scattered 4 KiB live pages into contiguous 2 MiB hugepages, collocating them into fewer distinct frames — so the post-pause snapshot faults fewer frames on the next resume. This PR adds an envd endpoint to do that and wires the orchestrator to call it just before pause, behind a flag.What
envd:
POST /collapse(packages/envd/, version bumped to0.6.4)/proc/self/maps, and for each anonymous read-write region marks itMADV_HUGEPAGEthen issuesMADV_COLLAPSEper 2 MiB chunk (skipping empty chunks).200+ JSONCollapseResult{regions,chunks,collapsed,skipped,elapsedMs}and logs the same stats in-guest.MADV_COLLAPSEover a whole regionEINVALs on the first empty chunk; iterating 2 MiB windows and skipping empties is what makes it work on a sparse heap.MADV_HUGEPAGEfirst setsVM_HUGEPAGEso the collapse is eligible regardless of the system THP mode (no globaltransparent_hugepage=alwaysneeded).orchestrator: flag-gated pre-pause collapse (
packages/orchestrator/pkg/sandbox/)collapse-envd-heap(default off). When set,bestEffortReclaimcalls envd's/collapsejust before pause, alongside the existing freeze step.MinEnvdVersionForHeapCollapse) so older orchestrators/envds never call an endpoint that isn't there; dedicated 10 s timeout; best-effort so a failure never blocks pause.callEnvdCollapsedecodes theCollapseResult(envd's client stub regenerated for the new 200+body shape).Observability (
packages/shared/pkg/telemetry, orchestrator)orchestrator.sandbox.envd.collapse.duration(round-trip ms,successattribute) — the pause-path cost.orchestrator.sandbox.envd.collapse.chunkssplit byresult=collapsed|skipped— chunk attempts (total) vs successes (collapsed).envd-collapsespan carryingcollapse.{regions,chunks,collapsed,skipped,duration_ms,success}, plus a success log line ("envd heap collapsed") so efficacy is visible in Loki even when per-process metric counters reset on an orchestrator restart.Validation
internal/services/memory):/proc/self/mapsanon-rw parser,collapseRangeagainst a sparsely-populated mmap'd region (one touched page per 2 MiB window), and aCollapseSelfsmoke test.go build/vetclean;golangci-lint0 issues across the changed packages.use-nfs-for-snapshots=falseto force GCS-cold): CTRL 1005 frames / 2010 MB / 22.6 s → COLL 392 frames / 784 MB / 9.5 s — −61 % faults, −61 % bytes, −58 % latency, with 385 chunks consolidated in 428 ms at pause. Metrics + span confirmed in Mimir/Tempo end-to-end (service_version 0.2.0-…).