Skip to content

fix(worker): sample action peak memory by process group#2409

Merged
amankrx merged 1 commit into
TraceMachina:mainfrom
amankrx:fix/action-resource-usage-sampler-reparenting
Jun 9, 2026
Merged

fix(worker): sample action peak memory by process group#2409
amankrx merged 1 commit into
TraceMachina:mainfrom
amankrx:fix/action-resource-usage-sampler-reparenting

Conversation

@amankrx

@amankrx amankrx commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Description

The action-resource-usage sampler walked the spawned child's process subtree (/proc//task/*/children) to sum RSS. When an intermediate shell exits and reparents the real workload to the worker (PID 1), the spawned child is left a childless zombie, so the walk reports 0. The worker then omits ActionResourceUsage (gated on peak_memory_kb > 0), no ResponseEvent.ActionResourceUsage origin event is published, and execution_tasks.observed_worker_peak_memory_mib stays NULL.

Spawn the action as its own process-group leader and sum RSS of every process whose pgrp == pgid instead. Process-group membership is inherited and survives reparenting, and excludes the worker's own group.

Type of change

Please delete options that aren't relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Tested with local on prem test

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

Someone is attempting to deploy a commit to the native-link-web-assets Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant

CLAassistant commented Jun 9, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@amankrx amankrx force-pushed the fix/action-resource-usage-sampler-reparenting branch 2 times, most recently from 6095584 to b0fb22f Compare June 9, 2026 01:09
@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
nativelink Ready Ready Preview, Comment Jun 9, 2026 2:33pm
nativelink-aidm Ready Ready Preview, Comment Jun 9, 2026 2:33pm

Request Review

Comment thread nativelink-worker/src/running_actions_manager.rs
Comment thread nativelink-worker/src/running_actions_manager.rs Outdated
The action-resource-usage sampler walked the spawned child's process
subtree (/proc/<pid>/task/*/children) to sum RSS. When an intermediate
shell exits and reparents the real workload to the worker (PID 1), the
spawned child is left a childless zombie, so the walk reports 0. The
worker then omits ActionResourceUsage (gated on peak_memory_kb > 0), no
ResponseEvent.ActionResourceUsage origin event is published, and
execution_tasks.observed_worker_peak_memory_mib stays NULL.

Spawn the action as its own process-group leader and sum RSS of every
process whose pgrp == pgid instead. Process-group membership is inherited
and survives reparenting, and excludes the worker's own group.
@amankrx amankrx merged commit 5a9259e into TraceMachina:main Jun 9, 2026
64 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants