Tianye/4gpu 2ppl fix on open pipelines#16
Merged
zhenyulincs merged 5 commits intoMay 24, 2026
Merged
Conversation
…lout_open_pipelines pending_bucket_gen is a per-cycle snapshot that disappears once the GENERATION request is signalled. Between signal and the first ProgressReport from begin_progress_batch(), the planner had no demand signal for the pipeline and could skip it entirely — stranding it without GPU workers. Introduce rollout_open_pipelines (pipeline_id -> step_target_estimate) in SchedulerState as a durable replacement: - Set when request_gpus(priority=GENERATION) is enqueued - Cleared when the GENERATION cluster is released or clear_progress() retires all coordinator batch streams planner.plan_generation_gap_ratio() now accepts rollout_open_pipelines instead of pending_bucket_gen. The two helper functions has_pending_generation_request / get_pending_generation_step_target_estimate are removed; their logic is inlined against the dict directly. The bootstrap step_target path (step_target <= 0) and demand inflation both key off rollout_open_pipelines, which covers the full rollout lifecycle rather than just the brief window before the pending request is consumed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tor no-restart tombstones On top of dfd53f3's rollout_open_pipelines fix, this commit lands the two pieces of the prior 4-GPU 2-pipeline iteration that are NOT superseded by the new durable-open-registry scheme: 1) Phase B step 4c — MilesPipeline now calls rollout_manager.set_rlix_hooks.remote(MilesRLixHooks( coordinator_handle=self._coordinator_handle, pipeline_id=self._pipeline_id, )) inside _init_phase_b_infer (line ~372). Without this, every begin_progress_batch / bump_completed from the rollout function lands on NoOpRLixHooks and the scheduler never sees rollout demand — gap-ratio then has no signal to wake engines for rollout N+1 after _after_training released actor_train. Requires the matching set_rlix_hooks plumbing in TianyeGGBond/miles@tianye/4gpu-2ppl-fix. 2) MilesRLixHooks observability — begin_progress_batch now emits one INFO line, _fire_report adds a DEBUG diagnostic, and the missing-coordinator-method branch is upgraded from DEBUG to WARNING ("progress signal lost"). Lets future hangs be traced without a debugger. Plus tombstone comments on Orchestrator/SchedulerActor max_restarts=0, max_task_retries=0 documenting WHY restart is unsafe (in-memory state on _topology_ready / active_allocations / pipeline_registry would be wiped on a Ray-driven restart, leaving the actor stuck waiting on a re-set event). Documentation-only; the values were already 0/0. The obsolete pieces from prior iteration commit 1c9755a (the step_target_estimate-on-ClusterAllocation Layer-1 fallback, gen_step_target_estimates hoisting, ClusterAllocation field bump, and the 354-line tests/test_gap_ratio.py addition) are dropped here because dfd53f3 makes them redundant — rollout_open_pipelines already gives the planner durable demand-window visibility without needing a fallback to the persisted estimate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-signal scheduler demand BEFORE each rollout dispatch so the gap-ratio planner can wake actor_infer engines without waiting for the rollout function's first begin_progress_batch — too late under cross-pipeline contention, where the first pipeline to fire begin_progress_batch wins the GENERATION budget and the peer pipeline starves on "Warning: No progress for 30.0s. Queue size: 0, Collected: 0/N". Root cause: between rollouts, both pipelines' actor_infer DP workers shrink to set() because actor_train (ACTOR_TRAINING priority) preempts the shared infer GPUs. The existing gap-ratio fallback (alloc.step_target_estimate + v3 peer-signal rule) keeps both pipelines visible to the planner but does NOT proactively expand engines until a fresh demand signal lands — and that signal previously arrived inside the rollout function (via MilesRLixHooks.begin_progress_batch), creating a chicken-and-egg where the function needs engines awake to dispatch its first sample but the scheduler needs the first sample to know the function needs engines. Mirrors rlix/pipeline/full_finetune_pipeline.py:696 (_notify_release_then_request_cluster_gpus) which re-requests actor_infer at GENERATION with a fresh step_target_estimate per rollout. miles cannot release+re-request actor_infer between rollouts without tearing down SGLang engines, so this fix instead publishes a synthetic new_batch=True progress report shaped exactly like MilesCoordinator._aggregate_and_emit output — the scheduler "latest wins" overwrites it cleanly when the real begin_progress_batch lands. Changes: - MilesPipeline.signal_rollout_demand(rollout_id, step_target): posts ProgressReport(mode="aggregated", new_batch=True, completed=0, step_target_trajectories=step_target) to scheduler.report_progress. Best-effort: try/except logs warning and returns so a transient RPC failure cannot block the rollout (in-rollout begin_progress_batch remains the correctness baseline). - Orchestrator + Scheduler max_concurrency=4 (was default 1) — mitigates the Ray-2.55.1 worker.py:1039 "'Worker' object has no attribute core_worker" race that fires more readily under the added per-rollout signal_demand traffic. Hot-path per-rollout RPC bypasses the orchestrator (signal_demand goes pipeline -> scheduler directly), and the scheduler's state mutations remain under self._lock = asyncio.Lock, so the bump is data-hazard-free. Test: run 28 of /root/run_4ppl_full_overlap_n10.sh (--num-rollout 5, P1 train=[0,1] infer=[0,1,2,3], P2 train=[2,3] infer=[0,1,2,3]) — completed all 5 rollouts cleanly with EXIT=0; prior runs (run 25 baseline, run 27 with concurrency-bump only) hung on rollout 2 collection with "No progress for 30.0s". Companion miles changes (rlix_train_loop.signal_demand hook + run_miles_dual driver wire-up) ship on miles@tianye/4gpu-2ppl-fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the dfd53f3 base, scheduler.report_progress does NOT write to SchedulerState.rollout_open_pipelines — that registry is only populated by request_gpus (line 638) and cleared by clear_progress / notify_release_gpus. The planner's _receiver_eligible and the remaining/step_target derivation read FROM this registry. The previous signal_rollout_demand posted a synthetic ProgressReport which was a no-op against this contract: the report landed in latest_progress_by_pipeline but never marked the pipeline as "open" in rollout_open_pipelines. After the prior rollout's clear_progress had popped the registry entry, the planner skipped the pipeline entirely (continue) and engines never woke. Empirical repro: run 29 (smoke n=3 on rebased branch) hung on rollout 2 with "No progress for 30.0s. Queue size: 0, Collected: 0/4" for both pipelines simultaneously. The fix swaps to scheduler.request_gpus(actor_infer, GENERATION, step_target_estimate=N). request_gpus's existing-allocation path at scheduler.py:614-625 gives us the right semantics for free: - alloc.active_dp_ranks non-empty (mid-rollout) — short-circuits and returns the existing GPU list. Cheap no-op. - alloc.active_dp_ranks == set() (rollout boundary) — enqueues a pending request, writes rollout_open_pipelines[pipeline_id] = step_target_estimate, fires the wakeup, BLOCKS until the planner grants at least one DP worker. The blocking semantics on the boundary case give us the synchronization the synthetic ProgressReport could never provide: when signal_rollout_demand returns, the rollout function is guaranteed at least one awake engine for its first sample dispatch. Also removes the now-unused ProgressReport import and updates docs/internal/4gpu-2ppl-rollout2-hang-fix.md to describe the request_gpus contract instead of the synthetic-ProgressReport contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… test) Adds two-tier coverage for the Ray-2.55.1 worker.py:1039 race that max_concurrency=4 only mitigates: 1) test_no_runtime_orchestrator_dependency_in_scheduler — static guard. Asserts SchedulerImpl's runtime surface (request_gpus, report_progress, notify_release_gpus, clear_progress, notify_release_then_request_gpus) does NOT touch the Orchestrator at runtime. If a future refactor re-introduces a runtime dependency, this fires in CI before it can hit a real training run. Always runs (no Ray cluster required). 2) test_scheduler_reachable_after_orchestrator_kill — chaos test. Brings up the rlix control plane via client.connect, kills the Orchestrator with ray.kill, then verifies the SchedulerActor handle stays resolvable via ray.get_actor. Skipped when Ray is unavailable. Updates docs/internal/4gpu-2ppl-rollout2-hang-fix.md with explicit mitigation-vs-fix analysis: the race itself is upstream Ray; we cannot patch worker.py:1039 cleanly. But it is architecturally harmless because (a) all orchestrator callers are driver-startup or fatal-error-only, (b) the fatal-error path already swallows ActorDiedError, and (c) the static guard test prevents regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhenyulincs
added a commit
that referenced
this pull request
May 24, 2026
TianyeGGBond's paired PRs (#16 + rlops/miles#4) ship the proper fix for B-13 (mp1 wedge after mp2 completes early under cross-overlap). The fix uses the scheduler's existing gap-ratio machinery via pre-signalled durable demand instead of fighting asyncio.gather, and is materially better than my proposed M11.3 fire-shutdown_hard-early sketch. Root cause (refined from TianyeGGBond's PR description, Codex review of the rlix planner code confirms): - Between rollouts, actor_train preempts shared infer GPUs → both pipelines' actor_infer DP workers shrink to set(). - begin_progress_batch fires INSIDE the rollout fn, after dispatch starts → first pipeline wins all DP workers, peer hangs. - Two compounding sub-bugs: (a) pending_bucket_gen is a per-cycle snapshot that vanishes once consumed (planner blind during rollout boundary); (b) MilesRLixHooks was never wired at init → begin_progress_batch hit NoOpRLixHooks (silent no-op). Joint fix: - rlix#16: durable rollout_open_pipelines registry in scheduler; MilesPipeline.signal_rollout_demand(rollout_id, step_target); wires MilesRLixHooks in Phase B init; pending_bucket_gen durability. - miles#4: _signal_demand step hook in dual driver; set_rlix_hooks + _rlix_hooks plumbing on RolloutManager; call_rollout_fn forwards rlix_hooks; max_concurrency=4 on coordinator; router 499 + counter balance fix. Codex joint verdict: APPROVE_WITH_NOTES, 0 BLOCKERS. Prior CRITICAL (signal_rollout_demand missing) + HIGH (set_rlix_hooks not wired) both RESOLVED in the pair. Joint smoke evidence (per miles#4 PR): run_4ppl_full_overlap_n10.sh --num-rollout 5, run 28 EXIT=0 — earlier baseline hung on rollout 2. Merge order required: rlix#16 MUST land before miles#4 (miles#4 alone raises AttributeError on pipe.signal_rollout_demand.remote without the rlix#16 method). M11.3 follow-up scope shrinks: concurrent-resize stress test under max_concurrency=4; automated 4-GPU 2-pipeline rollout-2+ regression; verify generate_rollout_fully_async accepts rlix_hooks kw.
zhenyulincs
added a commit
that referenced
this pull request
May 24, 2026
Both PRs merged into the zhenyu/* branches; my Phase 1/3 work fast-forwarded cleanly under both merges (no conflicts because TianyeGGBond branched off my prior commits). Verified merged state contains both: - My Phase 3c finish_init_offload + Phase 1 release_train_only on rlix - TianyeGGBond's signal_rollout_demand + set_rlix_hooks wiring on rlix - My Phase 2 overlap env-driven topology on miles - TianyeGGBond's _signal_demand step hook + max_concurrency=4 on miles Vast 4×A40 dual-overlap regression smoke pending on a new instance — to be run with the joint fix to confirm all 7 PASS-bar conditions (including C2 shutdown_hard complete, previously blocked by B-13).
zhenyulincs
added a commit
that referenced
this pull request
May 24, 2026
New: docs/m11-tianye-prs-review.md — standalone document summarizing TianyeGGBond's paired PRs (#16 + rlops/miles#4) that fixed B-13 (M11.2 overlap rollout-2+ hang). Covers: - Root cause analysis (per-cycle pending_bucket_gen + MilesRLixHooks never wired) - Fix architecture (durable rollout_open_pipelines registry + signal_rollout_demand pre-dispatch) - File-level change inventory across both PRs (~14 files / +300 LoC) - Validation layers: 5 Codex review rounds, local merge spot-check, real vast smoke v8/v9/v10 evidence - Outstanding follow-ups (max_concurrency=4 stress test, rlix_hooks kw verification) - Doc + commit pointers Easier to share than the full m11-2-overlap-log.md for reviewers who only need the Tianye-PR context.
zhenyulincs
added a commit
that referenced
this pull request
May 24, 2026
Adds §3.5 "Post-Option-A milestones" tracking 6 new lines of work: - Phase 1 — R04-F1 try/finally + release_train_only shim (rlix 47bf02b) - Phase 3 — Option β state machine (finish_init_offload + Codex Q5 invariants) (rlix 47bf02b) - Phase 7 — F3+F4 driver-side cleanup (asyncio.wait FIRST_EXCEPTION + try/finally shutdown_hard) (miles 79f2874) - Tianye's paired PRs (#16 + rlops/miles#4) fix B-13 wedge — durable rollout_open_pipelines registry + signal_rollout_demand + MilesRLixHooks wiring; merged 2026-05-24 - B-14 hardware compat — MILES_SKIP_TMS_PAUSE=1 was unconditionally set; removing it on stable-tms hardware enables tms.pause() (rlix a011bbf) - Batch B-15+F5+F9+F7 — harness C0 + nvidia-smi INFO + odd-GPU validation + pause_generation docs (rlix a4c6369 + miles 1487c3f) Updates §1.4 status table — M11.2 now has 3 rows: - Option A (disjoint pools, M11.2 Attempt 4 GREEN) - Real overlap (shared infer pool, verified 2026-05-24 v8/v9/v10) - M11.3+ deferred Updates §6 known limitations — marks F1/F3/F4/B-13/B-14/B-15/F5/F7/F9 as RESOLVED with commit pointers. Adds rows for MED1 (concurrent resize stress) + LOW1 (rlix_hooks kw) follow-ups. F2 noted as howard989's PR in flight. Updates Appendix C — current HEADs (rlix 9b1fa90, miles 1487c3f). Cross-refs to plans/m11-2-overlap-log.md + docs/m11-tianye-prs-review.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Between rollouts,
actor_trainpreempts shared infer GPUs, causing both pipelines'actor_inferDP workers to shrink toset().The gap-ratio planner needs a fresh demand signal to re-wake engines, but that signal only arrived from
begin_progress_batchinside the rollout function — after sample dispatch had already started.Race result: whichever pipeline fires
begin_progress_batchfirst wins all DP workers; the other pipeline hangs.Two sub-bugs compounded it:
pending_bucket_genis a per-cycle snapshot that vanishes once consumed, so the planner sees no demand during the rollout boundary gap.MilesRLixHookswas never wired at init, so everybegin_progress_batchhitNoOpRLixHooks; the scheduler never received rollout demand at all.sequenceDiagram participant P1 as Pipeline 1 participant P2 as Pipeline 2 participant Sched as Scheduler (gap-ratio) Note over P1,Sched: _after_training: actor_train preempts infer GPUs Note over Sched: active_dp_ranks = ∅ for both<br/>pending_bucket_gen consumed → planner skips both P1->>Sched: begin_progress_batch() ← first to fire Sched->>Sched: wakes P1 engines, allocates all DP workers ✓ P2->>Sched: begin_progress_batch() ← too late Sched-->>P2: ❌ P1 holds all DP workers Note over P2: "No progress for 30.0s. Queue: 0, Collected: 0/N"Fix
Pre-signal scheduler demand before the rollout function starts, using
request_gpusinstead ofreport_progress.This makes the call write to the new durable
rollout_open_pipelinesregistry and block until at least one DP worker is awake.sequenceDiagram participant P1 as Pipeline 1 participant P2 as Pipeline 2 participant Sched as Scheduler Note over P1,Sched: _after_training: actor_train preempts infer GPUs Note over Sched: active_dp_ranks = ∅ for both P1->>Sched: signal_rollout_demand() → request_gpus(GENERATION, step_target=N) P2->>Sched: signal_rollout_demand() → request_gpus(GENERATION, step_target=N) Note over Sched: rollout_open_pipelines = {P1: N, P2: N} ← durable<br/>gap-ratio sees both → allocates DP workers for both Sched-->>P1: unblocks ✓ (engines awake) Sched-->>P2: unblocks ✓ (engines awake) P1->>P1: rollout fn dispatches first sample ✓ P2->>P2: rollout fn dispatches first sample ✓Changes
dfd53f3pending_bucket_genwith durablerollout_open_pipelines(pipeline_id → step_target) inSchedulerState; planner reads it for the full rollout lifecycle.f2ac106MilesRLixHooksin_init_phase_b_infer; add no-restart tombstones toOrchestrator/SchedulerActor.173dbcdMilesPipeline.signal_rollout_demand()to pre-signal demand before rollout dispatch.8436e95signal_rollout_demandto callrequest_gpusinstead ofreport_progress.e6276b9Orchestratordependency in scheduler, plus chaos test where orchestrator kill leaves scheduler reachable.Requires companion
milesPRmiles@tianye/4gpu-2ppl-fixThis companion PR wires:
rlix_train_loop.signal_demandhookrun_miles_dualcall-siteTest
Unit tests
test_orchestrator_death_is_benign.pytest_gap_ratio.pytest_scheduler_apply_plan_invariants.pyE2E test
Configuration:
P1 train = [0, 1]P1 infer = [0, 1, 2, 3]P2 train = [2, 3]P2 infer = [0, 1, 2, 3]Result:
EXIT=0✓