feat(rlix): make MilesPipeline free-mem threshold configurable via MILES_MIN_FREE_GPU_MEM_GB#11
Closed
howard989 wants to merge 7 commits into
Conversation
…LES_MIN_FREE_GPU_MEM_GB
taoluo
reviewed
May 17, 2026
| # is the validated Qwen2.5-0.5B smoke setting; larger models can | ||
| # override it with MILES_MIN_FREE_GPU_MEM_GB without changing the | ||
| # driver CLI surface. | ||
| target_free_gb = parse_env_positive_float("MILES_MIN_FREE_GPU_MEM_GB", 20.0) |
Contributor
There was a problem hiding this comment.
free memory is gpu-model dependent e.g. 24gb vs 80gb gpu . it would be more robust to check the residual memory allocation?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
v2 per @taoluo review: replace the hardcoded free-memory threshold in
MilesPipeline._wait_for_overlap_engines_offloaded()with a runtime-configurable residual used GPU memory threshold.The env var is now:
The check is now:
Addresses review-report finding R02-01 (
plans/m11-review.review-report/R02.md).Why
The first version made the old free-memory threshold configurable, but Tao pointed out that free memory is GPU-model dependent:
The actual condition we need before
wake_upis not "at least N GB free"; it is "the previous tenant has released enough GPU memory." That is better represented by residual used memory.So this PR flips the semantic from:
to:
Coordinated MILES change
This PR is the receiver side. The matching sender change in MILES forwards the new env var from the driver shell into Ray
runtime_env:Both PRs must merge together for user overrides to propagate end-to-end.
Before
MilesPipeline._wait_for_overlap_engines_offloaded()used a free-memory threshold:That threshold was GPU-capacity dependent and not portable across 24 GB, 80 GB, and 96 GB GPUs.
After
MilesPipeline._wait_for_overlap_engines_offloaded()now uses residual used memory:The wait condition is:
Default
10.0is the current M11.2 smoke-safe default. This PR does not attempt to derive the residual threshold from model size yet; it changes the signal from GPU-capacity-dependent free memory to GPU-capacity-independent residual used memory, while keeping an env override for larger / different topologies.Invalid values are fail-fast:
RuntimeErrorRuntimeErrorForwarding chain
The env var crosses multiple Ray actor boundaries:
Changes
rlix/utils/env.pyparse_env_positive_floatrlix/pipeline/miles_pipeline.pynvidia-smi --query-gpu=memory.usedresidual_target=...rlix/pipeline/miles_coordinator.pyMILES_MAX_RESIDUAL_GPU_MEM_GBinto per-pipeline runtime envtests/test_env_utils.pyTests
Unit test:
Result:
E2E Verification
Default smoke
Ran M11.2 dual-pipeline smoke on Vast using the current PR head:
Key result:
This confirms the PR head passes the default M11.2 smoke.
Env override smoke
Ran M11.2 dual-pipeline smoke on Vast with the coordinated MILES PR and an explicit override:
Key proof that the override reached per-pipeline MilesPipeline actors:
Smoke completion:
Notes:
RolloutManager500 /RemoteProtocolErrortraces appear during shutdown as residual generate requests are cancelled while engines tear down. This is known shutdown noise from prior M11.2 smoke runs and does not affectEXIT_CODE=0.16.0was discarded because it changed the timing and exposed a rollout asyncCollected: 0/8hang. The PR keeps10.0, which is the smoke-safe default verified above.Scope
Configurability and signal-correctness only. This PR does not attempt model-size-derived threshold calculation; that would require a separate design using model size, runtime overhead, CUDA / NCCL residue, and topology information.
Refs:
plans/m11-review.review-report/R02.md(R02-01, MEDIUM).