fix(scheduler): 2gpu 2ppl overlap fix by TianyeGGBond · Pull Request #15 · rlops/rlix

TianyeGGBond · 2026-05-20T02:31:03Z

Problem

pending_bucket_gen is a per-cycle snapshot that disappears immediately after a GENERATION pending request is consumed by a signal. Between the signal being sent and the arrival of the first ProgressReport in begin_progress_batch(), the planner cannot see the demand for this pipeline and will skip it—causing 2gpu 2ppl task fail.

Solution

Introduce rollout_open_pipelines: Dict[str, Optional[int]] in SchedulerState:

Set when request_gpus(priority=GENERATION) is enqueued (carrying step_target_estimate)
Cleared when the GENERATION cluster is released or clear_progress() exits all batch streams

This covers the entire lifecycle from "pipeline initiating a rollout request" to "rollout completely ending,"

Changes

state.py: Added the rollout_open_pipelines field
scheduler.py: Added set/clear to the correct lifecycle nodes; removed the _clear_rollout_intent_locked() helper
planner.py: Replaced the pending_bucket_gen parameter with rollout_open_pipelines;

Removed has_pending_generation_request / get_pending_generation_step_target_estimate

tests: Updated all calls; incidentally fixed a deprecation warning for asyncio.get_event_loop().

One RLHF rollout round — signal lifecycle

  request_gpus()     signal consumed     begin_progress_batch()    end_progress_batch()    release GPU
        │                   │                      │                        │                   │
        ▼                   ▼                      ▼                        ▼                   ▼
────────●───────────────────●──────────────────────●────────────────────────●───────────────────●────▶
        │                   │                      │                        │
        │         pending request                  │
        │         removed from bucket              │
        │                   │                      │
        │◄───── gap ────────►                      │
        │   scheduler sees no demand               │
        │   pipeline skipped, GPU stolen           │
        │                   │                      │
        │                                          │
        │  latest_progress_by_pipeline ────────────●────────────────────────● cleared by clear_progress()
        │  (populated on first ProgressReport)     │
        │                                          │
        ├── pending_bucket_gen ──────●             │
        │   (old signal, gone        │             │
        │    after signal consumed)  ▲             │
        │                            └─ too short  │
        │                                          │
        └── rollout_open_pipelines ────────────────────────────────────────●─── cleared
            (new signal, set at                                             │
             request_gpus() time)                                  on release / clear_progress()


BEFORE fix:  planner eligibility = pending_bucket_gen OR active_dp_workers OR active_allocations
             └─ gap window: all three are empty → pipeline skipped

AFTER fix:   planner eligibility = rollout_open_pipelines OR active_dp_workers OR active_allocations
             └─ gap window: rollout_open_pipelines still set → pipeline stays eligible

…lout_open_pipelines pending_bucket_gen is a per-cycle snapshot that disappears once the GENERATION request is signalled. Between signal and the first ProgressReport from begin_progress_batch(), the planner had no demand signal for the pipeline and could skip it entirely — stranding it without GPU workers. Introduce rollout_open_pipelines (pipeline_id -> step_target_estimate) in SchedulerState as a durable replacement: - Set when request_gpus(priority=GENERATION) is enqueued - Cleared when the GENERATION cluster is released or clear_progress() retires all coordinator batch streams planner.plan_generation_gap_ratio() now accepts rollout_open_pipelines instead of pending_bucket_gen. The two helper functions has_pending_generation_request / get_pending_generation_step_target_estimate are removed; their logic is inlined against the dict directly. The bootstrap step_target path (step_target <= 0) and demand inflation both key off rollout_open_pipelines, which covers the full rollout lifecycle rather than just the brief window before the pending request is consumed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TianyeGGBond requested review from howard989, taoluo and zhenyulincs May 20, 2026 02:31

zhenyulincs merged commit dfd53f3 into rlops:zhenyu/miles-mvp-e2e May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): 2gpu 2ppl overlap fix#15

fix(scheduler): 2gpu 2ppl overlap fix#15
zhenyulincs merged 1 commit into
rlops:zhenyu/miles-mvp-e2efrom
TianyeGGBond:tianye/fix-rollout-open-pipelines

TianyeGGBond commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TianyeGGBond commented May 20, 2026

Problem

Solution

Changes

One RLHF rollout round — signal lifecycle

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants