Skip to content

feat(rlix): make MilesPipeline free-mem threshold configurable via MILES_MIN_FREE_GPU_MEM_GB#11

Closed
howard989 wants to merge 7 commits into
rlops:zhenyu/miles-mvp-e2efrom
howard989:howard/m11-configurable-free-gpu-threshold
Closed

feat(rlix): make MilesPipeline free-mem threshold configurable via MILES_MIN_FREE_GPU_MEM_GB#11
howard989 wants to merge 7 commits into
rlops:zhenyu/miles-mvp-e2efrom
howard989:howard/m11-configurable-free-gpu-threshold

Conversation

@howard989
Copy link
Copy Markdown
Collaborator

@howard989 howard989 commented May 10, 2026

What

v2 per @taoluo review: replace the hardcoded free-memory threshold in MilesPipeline._wait_for_overlap_engines_offloaded() with a runtime-configurable residual used GPU memory threshold.

The env var is now:

MILES_MAX_RESIDUAL_GPU_MEM_GB

The check is now:

nvidia-smi memory.used
max_used_gb <= residual_target_gb

Addresses review-report finding R02-01 (plans/m11-review.review-report/R02.md).

Why

The first version made the old free-memory threshold configurable, but Tao pointed out that free memory is GPU-model dependent:

20 GB free on a 24 GB GPU vs an 80 / 96 GB GPU mean very different things.

The actual condition we need before wake_up is not "at least N GB free"; it is "the previous tenant has released enough GPU memory." That is better represented by residual used memory.

So this PR flips the semantic from:

MILES_MIN_FREE_GPU_MEM_GB
memory.free
min_free_gb >= target

to:

MILES_MAX_RESIDUAL_GPU_MEM_GB
memory.used
max_used_gb <= target

Coordinated MILES change

This PR is the receiver side. The matching sender change in MILES forwards the new env var from the driver shell into Ray runtime_env:

rlops/miles PR #3: howard/m11-forward-min-free-gpu-env

Both PRs must merge together for user overrides to propagate end-to-end.

Before

MilesPipeline._wait_for_overlap_engines_offloaded() used a free-memory threshold:

target_free_gb = 20.0

That threshold was GPU-capacity dependent and not portable across 24 GB, 80 GB, and 96 GB GPUs.

After

MilesPipeline._wait_for_overlap_engines_offloaded() now uses residual used memory:

target_residual_gb = parse_env_positive_float(
    "MILES_MAX_RESIDUAL_GPU_MEM_GB",
    10.0,
)

The wait condition is:

max_used_gb <= target_residual_gb

Default 10.0 is the current M11.2 smoke-safe default. This PR does not attempt to derive the residual threshold from model size yet; it changes the signal from GPU-capacity-dependent free memory to GPU-capacity-independent residual used memory, while keeping an env override for larger / different topologies.

Invalid values are fail-fast:

  • non-numeric → RuntimeError
  • non-positive → RuntimeError

Forwarding chain

The env var crosses multiple Ray actor boundaries:

User shell
  -> miles run_miles_*.py runtime_env whitelist
       (miles PR #3)
  -> MilesCoordinator runtime_env
  -> MilesCoordinator._build_pipeline_env_vars
       (this PR)
  -> MilesPipeline runtime_env
  -> MilesPipeline._wait_for_overlap_engines_offloaded
       (this PR)

Changes

  • rlix/utils/env.py
    • adds parse_env_positive_float
  • rlix/pipeline/miles_pipeline.py
    • replaces free-memory threshold with residual-used threshold
    • probes nvidia-smi --query-gpu=memory.used
    • logs residual_target=...
  • rlix/pipeline/miles_coordinator.py
    • forwards MILES_MAX_RESIDUAL_GPU_MEM_GB into per-pipeline runtime env
  • tests/test_env_utils.py
    • covers default, override, non-positive, and non-numeric cases

Tests

Unit test:

python -m pytest -q tests/test_env_utils.py

Result:

4 passed in 0.01s

E2E Verification

Default smoke

Ran M11.2 dual-pipeline smoke on Vast using the current PR head:

rlix HEAD = f73bfafb06712b2501a658ca91caa609f0d0361a
default residual target = 10.0 GB

Key result:

shutdown_hard complete pipeline_id=miles_408112d83007
shutdown_hard complete pipeline_id=miles_bf79bf7d4e40
EXIT_CODE=0

This confirms the PR head passes the default M11.2 smoke.

Env override smoke

Ran M11.2 dual-pipeline smoke on Vast with the coordinated MILES PR and an explicit override:

MILES_MAX_RESIDUAL_GPU_MEM_GB=30 \
LOG=/root/logs/run_override_30.log \
SCRIPT=/root/rlix/scripts/run_smoke_dual.sh \
SILENCE_LIMIT=900 \
RUN_LIMIT=2400 \
bash /root/rlix/scripts/run_smoke_with_watchdog.sh

Key proof that the override reached per-pipeline MilesPipeline actors:

(MilesPipeline pid=189209) ... residual used max=6.51 GB across overlap GPUs [2] (residual_target=30.0 GB)
(MilesPipeline pid=187326) ... residual used max=6.51 GB across overlap GPUs [0] (residual_target=30.0 GB)
(MilesPipeline pid=189209) ... residual used max=12.50 GB across overlap GPUs [2] (residual_target=30.0 GB)
(MilesPipeline pid=187326) ... residual used max=12.28 GB across overlap GPUs [0] (residual_target=30.0 GB)

Smoke completion:

[2026-05-18 03:34:46] run_miles_dual.py:354 - [run_miles_dual] mp2 training loop complete pipeline_id=miles_c321e4f558cb
[2026-05-18 03:34:47] run_miles_dual.py:354 - [run_miles_dual] mp1 training loop complete pipeline_id=miles_98108b095b5c
[2026-05-18 03:34:47] run_miles_dual.py:374 - [run_miles_dual] shutdown_hard complete pipeline_id=miles_98108b095b5c
[2026-05-18 03:34:47] run_miles_dual.py:374 - [run_miles_dual] shutdown_hard complete pipeline_id=miles_c321e4f558cb
EXIT_CODE=0

Notes:

  • Repeated RolloutManager 500 / RemoteProtocolError traces appear during shutdown as residual generate requests are cancelled while engines tear down. This is known shutdown noise from prior M11.2 smoke runs and does not affect EXIT_CODE=0.
  • A temporary experiment with default 16.0 was discarded because it changed the timing and exposed a rollout async Collected: 0/8 hang. The PR keeps 10.0, which is the smoke-safe default verified above.

Scope

Configurability and signal-correctness only. This PR does not attempt model-size-derived threshold calculation; that would require a separate design using model size, runtime overhead, CUDA / NCCL residue, and topology information.

Refs: plans/m11-review.review-report/R02.md (R02-01, MEDIUM).

Comment thread rlix/pipeline/miles_pipeline.py Outdated
# is the validated Qwen2.5-0.5B smoke setting; larger models can
# override it with MILES_MIN_FREE_GPU_MEM_GB without changing the
# driver CLI surface.
target_free_gb = parse_env_positive_float("MILES_MIN_FREE_GPU_MEM_GB", 20.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

free memory is gpu-model dependent e.g. 24gb vs 80gb gpu . it would be more robust to check the residual memory allocation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants