Tianye/4gpu 2ppl fix on open pipelines by TianyeGGBond · Pull Request #16 · rlops/rlix

TianyeGGBond · 2026-05-22T19:10:02Z

Between rollouts, actor_train preempts shared infer GPUs, causing both pipelines' actor_infer DP workers to shrink to set().

The gap-ratio planner needs a fresh demand signal to re-wake engines, but that signal only arrived from begin_progress_batch inside the rollout function — after sample dispatch had already started.

Race result: whichever pipeline fires begin_progress_batch first wins all DP workers; the other pipeline hangs.

Two sub-bugs compounded it:

pending_bucket_gen is a per-cycle snapshot that vanishes once consumed, so the planner sees no demand during the rollout boundary gap.
MilesRLixHooks was never wired at init, so every begin_progress_batch hit NoOpRLixHooks; the scheduler never received rollout demand at all.

sequenceDiagram
    participant P1 as Pipeline 1
    participant P2 as Pipeline 2
    participant Sched as Scheduler (gap-ratio)

    Note over P1,Sched: _after_training: actor_train preempts infer GPUs
    Note over Sched: active_dp_ranks = ∅ for both<br/>pending_bucket_gen consumed → planner skips both

    P1->>Sched: begin_progress_batch()  ← first to fire
    Sched->>Sched: wakes P1 engines, allocates all DP workers ✓

    P2->>Sched: begin_progress_batch()  ← too late
    Sched-->>P2: ❌ P1 holds all DP workers
    Note over P2: "No progress for 30.0s. Queue: 0, Collected: 0/N"

Fix

Pre-signal scheduler demand before the rollout function starts, using request_gpus instead of report_progress.

This makes the call write to the new durable rollout_open_pipelines registry and block until at least one DP worker is awake.

sequenceDiagram
    participant P1 as Pipeline 1
    participant P2 as Pipeline 2
    participant Sched as Scheduler

    Note over P1,Sched: _after_training: actor_train preempts infer GPUs
    Note over Sched: active_dp_ranks = ∅ for both

    P1->>Sched: signal_rollout_demand() → request_gpus(GENERATION, step_target=N)
    P2->>Sched: signal_rollout_demand() → request_gpus(GENERATION, step_target=N)
    Note over Sched: rollout_open_pipelines = {P1: N, P2: N} ← durable<br/>gap-ratio sees both → allocates DP workers for both

    Sched-->>P1: unblocks ✓ (engines awake)
    Sched-->>P2: unblocks ✓ (engines awake)

    P1->>P1: rollout fn dispatches first sample ✓
    P2->>P2: rollout fn dispatches first sample ✓

Changes

Commit	What
`dfd53f3`	Replace `pending_bucket_gen` with durable `rollout_open_pipelines` (`pipeline_id → step_target`) in `SchedulerState`; planner reads it for the full rollout lifecycle.
`f2ac106`	Wire `MilesRLixHooks` in `_init_phase_b_infer`; add no-restart tombstones to `Orchestrator` / `SchedulerActor`.
`173dbcd`	Add `MilesPipeline.signal_rollout_demand()` to pre-signal demand before rollout dispatch.
`8436e95`	Fix `signal_rollout_demand` to call `request_gpus` instead of `report_progress`.
`e6276b9`	Add tests: static guard for no runtime `Orchestrator` dependency in scheduler, plus chaos test where orchestrator kill leaves scheduler reachable.

Requires companion `miles` PR

miles@tianye/4gpu-2ppl-fix

This companion PR wires:

rlix_train_loop.signal_demand hook
run_miles_dual call-site

Test

Unit tests

test_orchestrator_death_is_benign.py
test_gap_ratio.py
test_scheduler_apply_plan_invariants.py

E2E test

run_4ppl_full_overlap_n10.sh --num-rollout 5

Configuration:

P1 train = [0, 1]
P1 infer = [0, 1, 2, 3]
P2 train = [2, 3]
P2 infer = [0, 1, 2, 3]

Result:

Run 28: all 5 rollouts clean, EXIT=0 ✓

…lout_open_pipelines pending_bucket_gen is a per-cycle snapshot that disappears once the GENERATION request is signalled. Between signal and the first ProgressReport from begin_progress_batch(), the planner had no demand signal for the pipeline and could skip it entirely — stranding it without GPU workers. Introduce rollout_open_pipelines (pipeline_id -> step_target_estimate) in SchedulerState as a durable replacement: - Set when request_gpus(priority=GENERATION) is enqueued - Cleared when the GENERATION cluster is released or clear_progress() retires all coordinator batch streams planner.plan_generation_gap_ratio() now accepts rollout_open_pipelines instead of pending_bucket_gen. The two helper functions has_pending_generation_request / get_pending_generation_step_target_estimate are removed; their logic is inlined against the dict directly. The bootstrap step_target path (step_target <= 0) and demand inflation both key off rollout_open_pipelines, which covers the full rollout lifecycle rather than just the brief window before the pending request is consumed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tor no-restart tombstones On top of dfd53f3's rollout_open_pipelines fix, this commit lands the two pieces of the prior 4-GPU 2-pipeline iteration that are NOT superseded by the new durable-open-registry scheme: 1) Phase B step 4c — MilesPipeline now calls rollout_manager.set_rlix_hooks.remote(MilesRLixHooks( coordinator_handle=self._coordinator_handle, pipeline_id=self._pipeline_id, )) inside _init_phase_b_infer (line ~372). Without this, every begin_progress_batch / bump_completed from the rollout function lands on NoOpRLixHooks and the scheduler never sees rollout demand — gap-ratio then has no signal to wake engines for rollout N+1 after _after_training released actor_train. Requires the matching set_rlix_hooks plumbing in TianyeGGBond/miles@tianye/4gpu-2ppl-fix. 2) MilesRLixHooks observability — begin_progress_batch now emits one INFO line, _fire_report adds a DEBUG diagnostic, and the missing-coordinator-method branch is upgraded from DEBUG to WARNING ("progress signal lost"). Lets future hangs be traced without a debugger. Plus tombstone comments on Orchestrator/SchedulerActor max_restarts=0, max_task_retries=0 documenting WHY restart is unsafe (in-memory state on _topology_ready / active_allocations / pipeline_registry would be wiped on a Ray-driven restart, leaving the actor stuck waiting on a re-set event). Documentation-only; the values were already 0/0. The obsolete pieces from prior iteration commit 1c9755a (the step_target_estimate-on-ClusterAllocation Layer-1 fallback, gen_step_target_estimates hoisting, ClusterAllocation field bump, and the 354-line tests/test_gap_ratio.py addition) are dropped here because dfd53f3 makes them redundant — rollout_open_pipelines already gives the planner durable demand-window visibility without needing a fallback to the persisted estimate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-signal scheduler demand BEFORE each rollout dispatch so the gap-ratio planner can wake actor_infer engines without waiting for the rollout function's first begin_progress_batch — too late under cross-pipeline contention, where the first pipeline to fire begin_progress_batch wins the GENERATION budget and the peer pipeline starves on "Warning: No progress for 30.0s. Queue size: 0, Collected: 0/N". Root cause: between rollouts, both pipelines' actor_infer DP workers shrink to set() because actor_train (ACTOR_TRAINING priority) preempts the shared infer GPUs. The existing gap-ratio fallback (alloc.step_target_estimate + v3 peer-signal rule) keeps both pipelines visible to the planner but does NOT proactively expand engines until a fresh demand signal lands — and that signal previously arrived inside the rollout function (via MilesRLixHooks.begin_progress_batch), creating a chicken-and-egg where the function needs engines awake to dispatch its first sample but the scheduler needs the first sample to know the function needs engines. Mirrors rlix/pipeline/full_finetune_pipeline.py:696 (_notify_release_then_request_cluster_gpus) which re-requests actor_infer at GENERATION with a fresh step_target_estimate per rollout. miles cannot release+re-request actor_infer between rollouts without tearing down SGLang engines, so this fix instead publishes a synthetic new_batch=True progress report shaped exactly like MilesCoordinator._aggregate_and_emit output — the scheduler "latest wins" overwrites it cleanly when the real begin_progress_batch lands. Changes: - MilesPipeline.signal_rollout_demand(rollout_id, step_target): posts ProgressReport(mode="aggregated", new_batch=True, completed=0, step_target_trajectories=step_target) to scheduler.report_progress. Best-effort: try/except logs warning and returns so a transient RPC failure cannot block the rollout (in-rollout begin_progress_batch remains the correctness baseline). - Orchestrator + Scheduler max_concurrency=4 (was default 1) — mitigates the Ray-2.55.1 worker.py:1039 "'Worker' object has no attribute core_worker" race that fires more readily under the added per-rollout signal_demand traffic. Hot-path per-rollout RPC bypasses the orchestrator (signal_demand goes pipeline -> scheduler directly), and the scheduler's state mutations remain under self._lock = asyncio.Lock, so the bump is data-hazard-free. Test: run 28 of /root/run_4ppl_full_overlap_n10.sh (--num-rollout 5, P1 train=[0,1] infer=[0,1,2,3], P2 train=[2,3] infer=[0,1,2,3]) — completed all 5 rollouts cleanly with EXIT=0; prior runs (run 25 baseline, run 27 with concurrency-bump only) hung on rollout 2 collection with "No progress for 30.0s". Companion miles changes (rlix_train_loop.signal_demand hook + run_miles_dual driver wire-up) ship on miles@tianye/4gpu-2ppl-fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

On the dfd53f3 base, scheduler.report_progress does NOT write to SchedulerState.rollout_open_pipelines — that registry is only populated by request_gpus (line 638) and cleared by clear_progress / notify_release_gpus. The planner's _receiver_eligible and the remaining/step_target derivation read FROM this registry. The previous signal_rollout_demand posted a synthetic ProgressReport which was a no-op against this contract: the report landed in latest_progress_by_pipeline but never marked the pipeline as "open" in rollout_open_pipelines. After the prior rollout's clear_progress had popped the registry entry, the planner skipped the pipeline entirely (continue) and engines never woke. Empirical repro: run 29 (smoke n=3 on rebased branch) hung on rollout 2 with "No progress for 30.0s. Queue size: 0, Collected: 0/4" for both pipelines simultaneously. The fix swaps to scheduler.request_gpus(actor_infer, GENERATION, step_target_estimate=N). request_gpus's existing-allocation path at scheduler.py:614-625 gives us the right semantics for free: - alloc.active_dp_ranks non-empty (mid-rollout) — short-circuits and returns the existing GPU list. Cheap no-op. - alloc.active_dp_ranks == set() (rollout boundary) — enqueues a pending request, writes rollout_open_pipelines[pipeline_id] = step_target_estimate, fires the wakeup, BLOCKS until the planner grants at least one DP worker. The blocking semantics on the boundary case give us the synchronization the synthetic ProgressReport could never provide: when signal_rollout_demand returns, the rollout function is guaranteed at least one awake engine for its first sample dispatch. Also removes the now-unused ProgressReport import and updates docs/internal/4gpu-2ppl-rollout2-hang-fix.md to describe the request_gpus contract instead of the synthetic-ProgressReport contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… test) Adds two-tier coverage for the Ray-2.55.1 worker.py:1039 race that max_concurrency=4 only mitigates: 1) test_no_runtime_orchestrator_dependency_in_scheduler — static guard. Asserts SchedulerImpl's runtime surface (request_gpus, report_progress, notify_release_gpus, clear_progress, notify_release_then_request_gpus) does NOT touch the Orchestrator at runtime. If a future refactor re-introduces a runtime dependency, this fires in CI before it can hit a real training run. Always runs (no Ray cluster required). 2) test_scheduler_reachable_after_orchestrator_kill — chaos test. Brings up the rlix control plane via client.connect, kills the Orchestrator with ray.kill, then verifies the SchedulerActor handle stays resolvable via ray.get_actor. Skipped when Ray is unavailable. Updates docs/internal/4gpu-2ppl-rollout2-hang-fix.md with explicit mitigation-vs-fix analysis: the race itself is upstream Ray; we cannot patch worker.py:1039 cleanly. But it is architecturally harmless because (a) all orchestrator callers are driver-startup or fatal-error-only, (b) the fatal-error path already swallows ActorDiedError, and (c) the static guard test prevents regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

TianyeGGBond's paired PRs (#16 + rlops/miles#4) ship the proper fix for B-13 (mp1 wedge after mp2 completes early under cross-overlap). The fix uses the scheduler's existing gap-ratio machinery via pre-signalled durable demand instead of fighting asyncio.gather, and is materially better than my proposed M11.3 fire-shutdown_hard-early sketch. Root cause (refined from TianyeGGBond's PR description, Codex review of the rlix planner code confirms): - Between rollouts, actor_train preempts shared infer GPUs → both pipelines' actor_infer DP workers shrink to set(). - begin_progress_batch fires INSIDE the rollout fn, after dispatch starts → first pipeline wins all DP workers, peer hangs. - Two compounding sub-bugs: (a) pending_bucket_gen is a per-cycle snapshot that vanishes once consumed (planner blind during rollout boundary); (b) MilesRLixHooks was never wired at init → begin_progress_batch hit NoOpRLixHooks (silent no-op). Joint fix: - rlix#16: durable rollout_open_pipelines registry in scheduler; MilesPipeline.signal_rollout_demand(rollout_id, step_target); wires MilesRLixHooks in Phase B init; pending_bucket_gen durability. - miles#4: _signal_demand step hook in dual driver; set_rlix_hooks + _rlix_hooks plumbing on RolloutManager; call_rollout_fn forwards rlix_hooks; max_concurrency=4 on coordinator; router 499 + counter balance fix. Codex joint verdict: APPROVE_WITH_NOTES, 0 BLOCKERS. Prior CRITICAL (signal_rollout_demand missing) + HIGH (set_rlix_hooks not wired) both RESOLVED in the pair. Joint smoke evidence (per miles#4 PR): run_4ppl_full_overlap_n10.sh --num-rollout 5, run 28 EXIT=0 — earlier baseline hung on rollout 2. Merge order required: rlix#16 MUST land before miles#4 (miles#4 alone raises AttributeError on pipe.signal_rollout_demand.remote without the rlix#16 method). M11.3 follow-up scope shrinks: concurrent-resize stress test under max_concurrency=4; automated 4-GPU 2-pipeline rollout-2+ regression; verify generate_rollout_fully_async accepts rlix_hooks kw.

Both PRs merged into the zhenyu/* branches; my Phase 1/3 work fast-forwarded cleanly under both merges (no conflicts because TianyeGGBond branched off my prior commits). Verified merged state contains both: - My Phase 3c finish_init_offload + Phase 1 release_train_only on rlix - TianyeGGBond's signal_rollout_demand + set_rlix_hooks wiring on rlix - My Phase 2 overlap env-driven topology on miles - TianyeGGBond's _signal_demand step hook + max_concurrency=4 on miles Vast 4×A40 dual-overlap regression smoke pending on a new instance — to be run with the joint fix to confirm all 7 PASS-bar conditions (including C2 shutdown_hard complete, previously blocked by B-13).

New: docs/m11-tianye-prs-review.md — standalone document summarizing TianyeGGBond's paired PRs (#16 + rlops/miles#4) that fixed B-13 (M11.2 overlap rollout-2+ hang). Covers: - Root cause analysis (per-cycle pending_bucket_gen + MilesRLixHooks never wired) - Fix architecture (durable rollout_open_pipelines registry + signal_rollout_demand pre-dispatch) - File-level change inventory across both PRs (~14 files / +300 LoC) - Validation layers: 5 Codex review rounds, local merge spot-check, real vast smoke v8/v9/v10 evidence - Outstanding follow-ups (max_concurrency=4 stress test, rlix_hooks kw verification) - Doc + commit pointers Easier to share than the full m11-2-overlap-log.md for reviewers who only need the Tianye-PR context.

Adds §3.5 "Post-Option-A milestones" tracking 6 new lines of work: - Phase 1 — R04-F1 try/finally + release_train_only shim (rlix 47bf02b) - Phase 3 — Option β state machine (finish_init_offload + Codex Q5 invariants) (rlix 47bf02b) - Phase 7 — F3+F4 driver-side cleanup (asyncio.wait FIRST_EXCEPTION + try/finally shutdown_hard) (miles 79f2874) - Tianye's paired PRs (#16 + rlops/miles#4) fix B-13 wedge — durable rollout_open_pipelines registry + signal_rollout_demand + MilesRLixHooks wiring; merged 2026-05-24 - B-14 hardware compat — MILES_SKIP_TMS_PAUSE=1 was unconditionally set; removing it on stable-tms hardware enables tms.pause() (rlix a011bbf) - Batch B-15+F5+F9+F7 — harness C0 + nvidia-smi INFO + odd-GPU validation + pause_generation docs (rlix a4c6369 + miles 1487c3f) Updates §1.4 status table — M11.2 now has 3 rows: - Option A (disjoint pools, M11.2 Attempt 4 GREEN) - Real overlap (shared infer pool, verified 2026-05-24 v8/v9/v10) - M11.3+ deferred Updates §6 known limitations — marks F1/F3/F4/B-13/B-14/B-15/F5/F7/F9 as RESOLVED with commit pointers. Adds rows for MED1 (concurrent resize stress) + LOW1 (rlix_hooks kw) follow-ups. F2 noted as howard989's PR in flight. Updates Appendix C — current HEADs (rlix 9b1fa90, miles 1487c3f). Cross-refs to plans/m11-2-overlap-log.md + docs/m11-tianye-prs-review.md.

TianyeGGBond and others added 5 commits May 19, 2026 22:19

TianyeGGBond requested review from howard989, taoluo and zhenyulincs May 22, 2026 19:10

zhenyulincs merged commit 80583ee into rlops:zhenyu/miles-mvp-e2e May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tianye/4gpu 2ppl fix on open pipelines#16

Tianye/4gpu 2ppl fix on open pipelines#16
zhenyulincs merged 5 commits into
rlops:zhenyu/miles-mvp-e2efrom
TianyeGGBond:tianye/4gpu-2ppl-fix-on-open-pipelines

TianyeGGBond commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TianyeGGBond commented May 22, 2026

Fix

Changes

Requires companion miles PR

Test

Unit tests

E2E test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Requires companion `miles` PR