feat(rlix): make MilesPipeline free-mem threshold configurable via MILES_MIN_FREE_GPU_MEM_GB by howard989 · Pull Request #11 · rlops/rlix

howard989 · 2026-05-10T21:16:56Z

What

v2 per @taoluo review: replace the hardcoded free-memory threshold in MilesPipeline._wait_for_overlap_engines_offloaded() with a runtime-configurable residual used GPU memory threshold.

The env var is now:

MILES_MAX_RESIDUAL_GPU_MEM_GB

The check is now:

nvidia-smi memory.used
max_used_gb <= residual_target_gb

Addresses review-report finding R02-01 (plans/m11-review.review-report/R02.md).

Why

The first version made the old free-memory threshold configurable, but Tao pointed out that free memory is GPU-model dependent:

20 GB free on a 24 GB GPU vs an 80 / 96 GB GPU mean very different things.

The actual condition we need before wake_up is not "at least N GB free"; it is "the previous tenant has released enough GPU memory." That is better represented by residual used memory.

So this PR flips the semantic from:

MILES_MIN_FREE_GPU_MEM_GB
memory.free
min_free_gb >= target

to:

MILES_MAX_RESIDUAL_GPU_MEM_GB
memory.used
max_used_gb <= target

Coordinated MILES change

This PR is the receiver side. The matching sender change in MILES forwards the new env var from the driver shell into Ray runtime_env:

rlops/miles PR #3: howard/m11-forward-min-free-gpu-env

Both PRs must merge together for user overrides to propagate end-to-end.

Before

MilesPipeline._wait_for_overlap_engines_offloaded() used a free-memory threshold:

target_free_gb = 20.0

That threshold was GPU-capacity dependent and not portable across 24 GB, 80 GB, and 96 GB GPUs.

After

MilesPipeline._wait_for_overlap_engines_offloaded() now uses residual used memory:

target_residual_gb = parse_env_positive_float(
    "MILES_MAX_RESIDUAL_GPU_MEM_GB",
    10.0,
)

The wait condition is:

max_used_gb <= target_residual_gb

Default 10.0 is the current M11.2 smoke-safe default. This PR does not attempt to derive the residual threshold from model size yet; it changes the signal from GPU-capacity-dependent free memory to GPU-capacity-independent residual used memory, while keeping an env override for larger / different topologies.

Invalid values are fail-fast:

non-numeric → RuntimeError
non-positive → RuntimeError

Forwarding chain

The env var crosses multiple Ray actor boundaries:

User shell
  -> miles run_miles_*.py runtime_env whitelist
       (miles PR #3)
  -> MilesCoordinator runtime_env
  -> MilesCoordinator._build_pipeline_env_vars
       (this PR)
  -> MilesPipeline runtime_env
  -> MilesPipeline._wait_for_overlap_engines_offloaded
       (this PR)

Changes

rlix/utils/env.py
- adds parse_env_positive_float
rlix/pipeline/miles_pipeline.py
- replaces free-memory threshold with residual-used threshold
- probes nvidia-smi --query-gpu=memory.used
- logs residual_target=...
rlix/pipeline/miles_coordinator.py
- forwards MILES_MAX_RESIDUAL_GPU_MEM_GB into per-pipeline runtime env
tests/test_env_utils.py
- covers default, override, non-positive, and non-numeric cases

Tests

Unit test:

python -m pytest -q tests/test_env_utils.py

Result:

4 passed in 0.01s

E2E Verification

Default smoke

Ran M11.2 dual-pipeline smoke on Vast using the current PR head:

rlix HEAD = f73bfafb06712b2501a658ca91caa609f0d0361a
default residual target = 10.0 GB

Key result:

shutdown_hard complete pipeline_id=miles_408112d83007
shutdown_hard complete pipeline_id=miles_bf79bf7d4e40
EXIT_CODE=0

This confirms the PR head passes the default M11.2 smoke.

Env override smoke

Ran M11.2 dual-pipeline smoke on Vast with the coordinated MILES PR and an explicit override:

MILES_MAX_RESIDUAL_GPU_MEM_GB=30 \
LOG=/root/logs/run_override_30.log \
SCRIPT=/root/rlix/scripts/run_smoke_dual.sh \
SILENCE_LIMIT=900 \
RUN_LIMIT=2400 \
bash /root/rlix/scripts/run_smoke_with_watchdog.sh

Key proof that the override reached per-pipeline MilesPipeline actors:

(MilesPipeline pid=189209) ... residual used max=6.51 GB across overlap GPUs [2] (residual_target=30.0 GB)
(MilesPipeline pid=187326) ... residual used max=6.51 GB across overlap GPUs [0] (residual_target=30.0 GB)
(MilesPipeline pid=189209) ... residual used max=12.50 GB across overlap GPUs [2] (residual_target=30.0 GB)
(MilesPipeline pid=187326) ... residual used max=12.28 GB across overlap GPUs [0] (residual_target=30.0 GB)

Smoke completion:

[2026-05-18 03:34:46] run_miles_dual.py:354 - [run_miles_dual] mp2 training loop complete pipeline_id=miles_c321e4f558cb
[2026-05-18 03:34:47] run_miles_dual.py:354 - [run_miles_dual] mp1 training loop complete pipeline_id=miles_98108b095b5c
[2026-05-18 03:34:47] run_miles_dual.py:374 - [run_miles_dual] shutdown_hard complete pipeline_id=miles_98108b095b5c
[2026-05-18 03:34:47] run_miles_dual.py:374 - [run_miles_dual] shutdown_hard complete pipeline_id=miles_c321e4f558cb
EXIT_CODE=0

Notes:

Repeated RolloutManager 500 / RemoteProtocolError traces appear during shutdown as residual generate requests are cancelled while engines tear down. This is known shutdown noise from prior M11.2 smoke runs and does not affect EXIT_CODE=0.
A temporary experiment with default 16.0 was discarded because it changed the timing and exposed a rollout async Collected: 0/8 hang. The PR keeps 10.0, which is the smoke-safe default verified above.

Scope

Configurability and signal-correctness only. This PR does not attempt model-size-derived threshold calculation; that would require a separate design using model size, runtime overhead, CUDA / NCCL residue, and topology information.

Refs: plans/m11-review.review-report/R02.md (R02-01, MEDIUM).

…LES_MIN_FREE_GPU_MEM_GB

taoluo · 2026-05-17T15:57:23Z

+        # is the validated Qwen2.5-0.5B smoke setting; larger models can
+        # override it with MILES_MIN_FREE_GPU_MEM_GB without changing the
+        # driver CLI surface.
+        target_free_gb = parse_env_positive_float("MILES_MIN_FREE_GPU_MEM_GB", 20.0)


free memory is gpu-model dependent e.g. 24gb vs 80gb gpu . it would be more robust to check the residual memory allocation?

feat(rlix): make MilesPipeline free-mem threshold configurable via MI…

c194b11

…LES_MIN_FREE_GPU_MEM_GB

howard989 mentioned this pull request May 10, 2026

fix(miles): forward MILES_MIN_FREE_GPU_MEM_GB env to rlix MilesCoordinator runtime rlops/miles#3

Closed

taoluo reviewed May 17, 2026

View reviewed changes

howard989 added 6 commits May 17, 2026 14:11

fix(rlix): use residual GPU memory threshold for MilesPipeline wakeup

f73bfaf

merge: sync configurable residual GPU threshold with miles-mvp-e2e

3277387

fix(rlix): lower residual GPU memory threshold default

c7c8498

fix(rlix): gate rollout shrink on SGLang residual allocation

7a008db

chore(rlix): log SGLang residual allocation gate

3cfb3b1

fix(rlix): forward option beta env to MilesPipeline runtime

19e4090

howard989 closed this May 25, 2026

howard989 mentioned this pull request May 25, 2026

feat(miles): per-engine process-resident GPU residual gate + forward MILES_MAX_RESIDUAL_GPU_MEM_GB) #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rlix): make MilesPipeline free-mem threshold configurable via MILES_MIN_FREE_GPU_MEM_GB#11

feat(rlix): make MilesPipeline free-mem threshold configurable via MILES_MIN_FREE_GPU_MEM_GB#11
howard989 wants to merge 7 commits into
rlops:zhenyu/miles-mvp-e2efrom
howard989:howard/m11-configurable-free-gpu-threshold

howard989 commented May 10, 2026 •

edited

Loading

Uh oh!

taoluo May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

howard989 commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Coordinated MILES change

Before

After

Forwarding chain

Changes

Tests

E2E Verification

Default smoke

Env override smoke

Scope

Uh oh!

taoluo May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

howard989 commented May 10, 2026 •

edited

Loading