fix(scheduler): 2gpu 2ppl overlap fix#15
Merged
zhenyulincs merged 1 commit intoMay 24, 2026
Merged
Conversation
…lout_open_pipelines pending_bucket_gen is a per-cycle snapshot that disappears once the GENERATION request is signalled. Between signal and the first ProgressReport from begin_progress_batch(), the planner had no demand signal for the pipeline and could skip it entirely — stranding it without GPU workers. Introduce rollout_open_pipelines (pipeline_id -> step_target_estimate) in SchedulerState as a durable replacement: - Set when request_gpus(priority=GENERATION) is enqueued - Cleared when the GENERATION cluster is released or clear_progress() retires all coordinator batch streams planner.plan_generation_gap_ratio() now accepts rollout_open_pipelines instead of pending_bucket_gen. The two helper functions has_pending_generation_request / get_pending_generation_step_target_estimate are removed; their logic is inlined against the dict directly. The bootstrap step_target path (step_target <= 0) and demand inflation both key off rollout_open_pipelines, which covers the full rollout lifecycle rather than just the brief window before the pending request is consumed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
pending_bucket_genis a per-cycle snapshot that disappears immediately after aGENERATIONpending request is consumed by a signal. Between the signal being sent and the arrival of the firstProgressReportinbegin_progress_batch(), the planner cannot see the demand for this pipeline and will skip it—causing 2gpu 2ppl task fail.Solution
Introduce
rollout_open_pipelines: Dict[str, Optional[int]]inSchedulerState:Set when
request_gpus(priority=GENERATION)is enqueued (carryingstep_target_estimate)Cleared when the
GENERATIONcluster is released orclear_progress()exits all batch streamsThis covers the entire lifecycle from "pipeline initiating a rollout request" to "rollout completely ending,"
Changes
state.py: Added the
rollout_open_pipelinesfieldscheduler.py: Added
set/clearto the correct lifecycle nodes; removed the_clear_rollout_intent_locked()helperplanner.py: Replaced the
pending_bucket_genparameter withrollout_open_pipelines;Removed
has_pending_generation_request/get_pending_generation_step_target_estimateasyncio.get_event_loop().One RLHF rollout round — signal lifecycle
request_gpus() signal consumed begin_progress_batch() end_progress_batch() release GPU │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ────────●───────────────────●──────────────────────●────────────────────────●───────────────────●────▶ │ │ │ │ │ pending request │ │ removed from bucket │ │ │ │ │◄───── gap ────────► │ │ scheduler sees no demand │ │ pipeline skipped, GPU stolen │ │ │ │ │ │ │ latest_progress_by_pipeline ────────────●────────────────────────● cleared by clear_progress() │ (populated on first ProgressReport) │ │ │ ├── pending_bucket_gen ──────● │ │ (old signal, gone │ │ │ after signal consumed) ▲ │ │ └─ too short │ │ │ └── rollout_open_pipelines ────────────────────────────────────────●─── cleared (new signal, set at │ request_gpus() time) on release / clear_progress() BEFORE fix: planner eligibility = pending_bucket_gen OR active_dp_workers OR active_allocations └─ gap window: all three are empty → pipeline skipped AFTER fix: planner eligibility = rollout_open_pipelines OR active_dp_workers OR active_allocations └─ gap window: rollout_open_pipelines still set → pipeline stays eligible