Skip to content

fix(scheduler): 2gpu 2ppl overlap fix#15

Merged
zhenyulincs merged 1 commit into
rlops:zhenyu/miles-mvp-e2efrom
TianyeGGBond:tianye/fix-rollout-open-pipelines
May 24, 2026
Merged

fix(scheduler): 2gpu 2ppl overlap fix#15
zhenyulincs merged 1 commit into
rlops:zhenyu/miles-mvp-e2efrom
TianyeGGBond:tianye/fix-rollout-open-pipelines

Conversation

@TianyeGGBond

Copy link
Copy Markdown
Collaborator

Problem

pending_bucket_gen is a per-cycle snapshot that disappears immediately after a GENERATION pending request is consumed by a signal. Between the signal being sent and the arrival of the first ProgressReport in begin_progress_batch(), the planner cannot see the demand for this pipeline and will skip it—causing 2gpu 2ppl task fail.

Solution

Introduce rollout_open_pipelines: Dict[str, Optional[int]] in SchedulerState:

  • Set when request_gpus(priority=GENERATION) is enqueued (carrying step_target_estimate)

  • Cleared when the GENERATION cluster is released or clear_progress() exits all batch streams

This covers the entire lifecycle from "pipeline initiating a rollout request" to "rollout completely ending,"

Changes

  • state.py: Added the rollout_open_pipelines field

  • scheduler.py: Added set/clear to the correct lifecycle nodes; removed the _clear_rollout_intent_locked() helper

  • planner.py: Replaced the pending_bucket_gen parameter with rollout_open_pipelines;

Removed has_pending_generation_request / get_pending_generation_step_target_estimate

  • tests: Updated all calls; incidentally fixed a deprecation warning for asyncio.get_event_loop().

One RLHF rollout round — signal lifecycle

  request_gpus()     signal consumed     begin_progress_batch()    end_progress_batch()    release GPU
        │                   │                      │                        │                   │
        ▼                   ▼                      ▼                        ▼                   ▼
────────●───────────────────●──────────────────────●────────────────────────●───────────────────●────▶
        │                   │                      │                        │
        │         pending request                  │
        │         removed from bucket              │
        │                   │                      │
        │◄───── gap ────────►                      │
        │   scheduler sees no demand               │
        │   pipeline skipped, GPU stolen           │
        │                   │                      │
        │                                          │
        │  latest_progress_by_pipeline ────────────●────────────────────────● cleared by clear_progress()
        │  (populated on first ProgressReport)     │
        │                                          │
        ├── pending_bucket_gen ──────●             │
        │   (old signal, gone        │             │
        │    after signal consumed)  ▲             │
        │                            └─ too short  │
        │                                          │
        └── rollout_open_pipelines ────────────────────────────────────────●─── cleared
            (new signal, set at                                             │
             request_gpus() time)                                  on release / clear_progress()


BEFORE fix:  planner eligibility = pending_bucket_gen OR active_dp_workers OR active_allocations
             └─ gap window: all three are empty → pipeline skipped

AFTER fix:   planner eligibility = rollout_open_pipelines OR active_dp_workers OR active_allocations
             └─ gap window: rollout_open_pipelines still set → pipeline stays eligible

…lout_open_pipelines

pending_bucket_gen is a per-cycle snapshot that disappears once the
GENERATION request is signalled.  Between signal and the first
ProgressReport from begin_progress_batch(), the planner had no demand
signal for the pipeline and could skip it entirely — stranding it
without GPU workers.

Introduce rollout_open_pipelines (pipeline_id -> step_target_estimate)
in SchedulerState as a durable replacement:
- Set when request_gpus(priority=GENERATION) is enqueued
- Cleared when the GENERATION cluster is released or clear_progress()
  retires all coordinator batch streams

planner.plan_generation_gap_ratio() now accepts rollout_open_pipelines
instead of pending_bucket_gen.  The two helper functions
has_pending_generation_request / get_pending_generation_step_target_estimate
are removed; their logic is inlined against the dict directly.

The bootstrap step_target path (step_target <= 0) and demand inflation
both key off rollout_open_pipelines, which covers the full rollout
lifecycle rather than just the brief window before the pending request
is consumed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zhenyulincs zhenyulincs merged commit dfd53f3 into rlops:zhenyu/miles-mvp-e2e May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants