Skip to content

feat(orch): collapse envd's heap into 2 MiB hugepages before pause to cut cold-resume faults#2997

Draft
kalyazin wants to merge 3 commits into
mainfrom
kalyazin/envd-collapse-heap
Draft

feat(orch): collapse envd's heap into 2 MiB hugepages before pause to cut cold-resume faults#2997
kalyazin wants to merge 3 commits into
mainfrom
kalyazin/envd-collapse-heap

Conversation

@kalyazin

Copy link
Copy Markdown
Contributor

Why

A cold (GCS-backed) sandbox resume demand-faults guest RAM through userfaultfd at 2 MiB hugepage granularity, and each distinct 2 MiB frame is a separate serial fetch from remote storage (~tens of ms each). On the resume critical path the user cgroup is still frozen (freeze-user-cgroup), so the only pages faulted before envd answers /init are envd's own. envd is a long-lived Go process whose heap fragments over time across many distinct guest-physical 2 MiB frames at low fill — so restoring envd touches far more frames than its live byte size implies, and that dominates time-to-envd on a cold resume.

MADV_COLLAPSE migrates those scattered 4 KiB live pages into contiguous 2 MiB hugepages, collocating them into fewer distinct frames — so the post-pause snapshot faults fewer frames on the next resume. This PR adds an envd endpoint to do that and wires the orchestrator to call it just before pause, behind a flag.

What

envd: POST /collapse (packages/envd/, version bumped to 0.6.4)

  • Walks /proc/self/maps, and for each anonymous read-write region marks it MADV_HUGEPAGE then issues MADV_COLLAPSE per 2 MiB chunk (skipping empty chunks).
  • Returns 200 + JSON CollapseResult{regions,chunks,collapsed,skipped,elapsedMs} and logs the same stats in-guest.
  • Notes:
    • Chunk-wise, not whole-VMA: a single MADV_COLLAPSE over a whole region EINVALs on the first empty chunk; iterating 2 MiB windows and skipping empties is what makes it work on a sparse heap.
    • MADV_HUGEPAGE first sets VM_HUGEPAGE so the collapse is eligible regardless of the system THP mode (no global transparent_hugepage=always needed).
    • Best-effort: per-chunk failures are counted, never fatal.

orchestrator: flag-gated pre-pause collapse (packages/orchestrator/pkg/sandbox/)

  • New LD flag collapse-envd-heap (default off). When set, bestEffortReclaim calls envd's /collapse just before pause, alongside the existing freeze step.
  • Gated on envd version ≥ 0.6.4 (MinEnvdVersionForHeapCollapse) so older orchestrators/envds never call an endpoint that isn't there; dedicated 10 s timeout; best-effort so a failure never blocks pause.
  • callEnvdCollapse decodes the CollapseResult (envd's client stub regenerated for the new 200+body shape).

Observability (packages/shared/pkg/telemetry, orchestrator)

  • Histogram orchestrator.sandbox.envd.collapse.duration (round-trip ms, success attribute) — the pause-path cost.
  • Counter orchestrator.sandbox.envd.collapse.chunks split by result=collapsed|skipped — chunk attempts (total) vs successes (collapsed).
  • Dedicated envd-collapse span carrying collapse.{regions,chunks,collapsed,skipped,duration_ms,success}, plus a success log line ("envd heap collapsed") so efficacy is visible in Loki even when per-process metric counters reset on an orchestrator restart.

Validation

  • Unit tests (internal/services/memory): /proc/self/maps anon-rw parser, collapseRange against a sparsely-populated mmap'd region (one touched page per 2 MiB window), and a CollapseSelf smoke test. go build/vet clean; golangci-lint 0 issues across the changed packages.
  • resume-build A/B (GCS-cold, envd heap physically scattered): cold-resume working set CTRL 1586 → COLL 465 distinct 2 MiB frames (−71 %), latency 32.2 s → 8.2 s (−74 %); collapse consolidated ~250 of ~290 chunks.
  • Full managed-orchestrator A/B (e2b API → orchestrator → Firecracker, use-nfs-for-snapshots=false to force GCS-cold): CTRL 1005 frames / 2010 MB / 22.6 s → COLL 392 frames / 784 MB / 9.5 s−61 % faults, −61 % bytes, −58 % latency, with 385 chunks consolidated in 428 ms at pause. Metrics + span confirmed in Mimir/Tempo end-to-end (service_version 0.2.0-…).

@cla-bot cla-bot Bot added the cla-signed label Jun 12, 2026
@cursor

cursor Bot commented Jun 12, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Touches the pause/snapshot path and envd's live heap via madvise; rollout is flag- and version-gated and best-effort, but enabled runs add pause latency and alter captured memory layout.

Overview
Adds a best-effort pre-pause step so envd compacts its own scattered Go heap into 2 MiB transparent hugepages, reducing distinct guest frames that cold resume must fault from remote storage. envd 0.6.4 exposes POST /collapse, which walks anonymous read-write mappings and applies per-2 MiB MADV_HUGEPAGE / MADV_COLLAPSE, returning CollapseResult stats. The orchestrator calls it from the existing pre-pause reclaim path when LaunchDarkly collapse-envd-heap is on (default off), gated on envd ≥ 0.6.4, with a 10s timeout and failures never blocking pause; metrics and spans record collapse duration and chunk outcomes. resume-build gains a -collapse flag to force the feature in dev benches.

Reviewed by Cursor Bugbot for commit d58d318. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 27.72277% with 146 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/orchestrator/pkg/sandbox/reclaim.go 0.00% 58 Missing ⚠️
packages/envd/internal/api/collapse.go 0.00% 36 Missing ⚠️
packages/orchestrator/pkg/sandbox/envd.go 0.00% 34 Missing ⚠️
...es/envd/internal/services/memory/collapse_linux.go 80.00% 8 Missing and 6 partials ⚠️
packages/orchestrator/cmd/resume-build/main.go 0.00% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The fields of envd.CollapseResult (such as Collapsed, Skipped, Regions, and Chunks) are generated as pointers because they are optional in the OpenAPI schema. Accessing them directly in bestEffortCollapse will cause compile-time errors or potential nil-pointer dereference panics at runtime, so they should be safely dereferenced using a helper like utils.DerefOrDefault before use.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +190 to +206
envdCollapseChunks.Add(ctx, int64(stats.Collapsed), metric.WithAttributes(attribute.String("result", "collapsed")))
envdCollapseChunks.Add(ctx, int64(stats.Skipped), metric.WithAttributes(attribute.String("result", "skipped")))
span.SetAttributes(
attribute.Int("collapse.regions", stats.Regions),
attribute.Int("collapse.chunks", stats.Chunks),
attribute.Int("collapse.collapsed", stats.Collapsed),
attribute.Int("collapse.skipped", stats.Skipped),
)

logger.L().Info(ctx, "envd heap collapsed",
logger.WithSandboxID(s.Runtime.SandboxID),
zap.Int("regions", stats.Regions),
zap.Int("chunks", stats.Chunks),
zap.Int("collapsed", stats.Collapsed),
zap.Int("skipped", stats.Skipped),
zap.Int64("duration_ms", elapsedMs),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The fields of envd.CollapseResult are generated as pointers because they are optional in the OpenAPI schema, which causes compile-time errors when converting them directly to integers or passing them to logging and tracing functions. Safely dereference these fields using utils.DerefOrDefault to prevent compilation failures and potential nil-pointer dereference panics.

	collapsed := utils.DerefOrDefault(stats.Collapsed, 0)
	skipped := utils.DerefOrDefault(stats.Skipped, 0)
	regions := utils.DerefOrDefault(stats.Regions, 0)
	chunks := utils.DerefOrDefault(stats.Chunks, 0)

	envdCollapseChunks.Add(ctx, int64(collapsed), metric.WithAttributes(attribute.String("result", "collapsed")))
	envdCollapseChunks.Add(ctx, int64(skipped), metric.WithAttributes(attribute.String("result", "skipped")))
	span.SetAttributes(
		attribute.Int("collapse.regions", regions),
		attribute.Int("collapse.chunks", chunks),
		attribute.Int("collapse.collapsed", collapsed),
		attribute.Int("collapse.skipped", skipped),
	)

	logger.L().Info(ctx, "envd heap collapsed",
		logger.WithSandboxID(s.Runtime.SandboxID),
		zap.Int("regions", regions),
		zap.Int("chunks", chunks),
		zap.Int("collapsed", collapsed),
		zap.Int("skipped", skipped),
		zap.Int64("duration_ms", elapsedMs),
	)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a false positive here. reclaim.go consumes the orchestrator's generated client (packages/orchestrator/pkg/sandbox/envd/envd.gen.go), whose cfg.yaml sets prefer-skip-optional-pointer: true — so CollapseResult fields are plain values (Chunks int, Collapsed int, ElapsedMs int64, …), not pointers. int64(stats.Collapsed) compiles, and utils.DerefOrDefault(stats.Collapsed, 0) would actually fail to compile (can't deref a non-pointer). Confirmed by the passing orchestrator build/lint/ARM64 checks. (The pointer-typed CollapseResult is the envd-side generated type, which is not used on this path.)

continue
}

regions = append(regions, region{start: uintptr(start), end: uintptr(end)})
continue
}

regions = append(regions, region{start: uintptr(start), end: uintptr(end)})
@kalyazin kalyazin force-pushed the kalyazin/envd-collapse-heap branch from 28e5307 to 080d4c2 Compare June 12, 2026 20:45
kalyazin and others added 3 commits June 12, 2026 21:47
envd's Go heap arenas are physically scattered across many distinct 2 MiB
guest-physical frames. On resume each frame is a separate cold UFFD fault, and
envd-init touches them serially, so envd's own footprint dominates resume
latency. Add a native POST /collapse endpoint that walks /proc/self/maps and
madvise(MADV_COLLAPSE)es each anonymous read-write region chunk-by-chunk
(2 MiB at a time, skipping empty chunks) after first marking it MADV_HUGEPAGE,
consolidating the live pages into transparent hugepages so the post-pause
snapshot faults far fewer frames on resume.

The collapse is best-effort: per-chunk failures are counted, not fatal. The
endpoint returns 200 with the per-call stats (regions/chunks/collapsed/skipped
and elapsed time) so the orchestrator can record them as metrics and span
attributes; the same stats are logged in-guest. Bump the envd version to 0.6.4
so the orchestrator can gate the call on endpoint availability.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mirror the user-cgroup-freeze pre-pause path: when the new collapse-envd-heap
LaunchDarkly flag is set, the orchestrator calls envd's POST /collapse just
before pause so envd consolidates its scattered heap into hugepages and faults
fewer distinct frames on resume. Gate on envd version (>= 0.6.4, which exposes
the endpoint) with a dedicated 10s timeout, and keep it best-effort so a failure
never blocks pause. The flag defaults off and is rolled out via LaunchDarkly.

Observe the collapse on a dedicated envd-collapse span and two metrics: the
orchestrator.sandbox.envd.collapse.duration histogram (round-trip ms, with a
success attribute) for the pause-path cost, and the
orchestrator.sandbox.envd.collapse.chunks counter split by result
(collapsed|skipped) so chunk attempts (total) and successes (collapsed) are both
queryable. The per-call stats are attached as span attributes and logged on
success ("envd heap collapsed") so efficacy is visible in Loki when the metric
counters reset on restart. Add a resume-build -collapse flag that overrides the
LD flag on for local testing.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unit-test the /proc/self/maps anonymous-rw parser, integration-test collapseRange
against a sparsely-populated mmap'd region (one touched page per 2 MiB window),
and smoke-test CollapseSelf on the test process's own heap.

Signed-off-by: Nikita Kalyazin <nikita.kalyazin@e2b.dev>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kalyazin kalyazin force-pushed the kalyazin/envd-collapse-heap branch from 25f42bd to d58d318 Compare June 12, 2026 20:47

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d58d318. Configure here.

jsonError(w, http.StatusInternalServerError, err)

return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collapse ignores request cancellation

Medium Severity

PostCollapse runs memory.CollapseSelf() without honoring r.Context(). When the orchestrator’s 10s POST /collapse times out or the client disconnects, envd can keep issuing MADV_COLLAPSE while pause continues, unlike /freeze which uses the request context for its lock.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d58d318. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants