Skip to content

Tianye/4gpu 2ppl fix on open pipelines#16

Merged
zhenyulincs merged 5 commits into
rlops:zhenyu/miles-mvp-e2efrom
TianyeGGBond:tianye/4gpu-2ppl-fix-on-open-pipelines
May 24, 2026
Merged

Tianye/4gpu 2ppl fix on open pipelines#16
zhenyulincs merged 5 commits into
rlops:zhenyu/miles-mvp-e2efrom
TianyeGGBond:tianye/4gpu-2ppl-fix-on-open-pipelines

Conversation

@TianyeGGBond
Copy link
Copy Markdown
Collaborator

Between rollouts, actor_train preempts shared infer GPUs, causing both pipelines' actor_infer DP workers to shrink to set().

The gap-ratio planner needs a fresh demand signal to re-wake engines, but that signal only arrived from begin_progress_batch inside the rollout function — after sample dispatch had already started.

Race result: whichever pipeline fires begin_progress_batch first wins all DP workers; the other pipeline hangs.

Two sub-bugs compounded it:

  • pending_bucket_gen is a per-cycle snapshot that vanishes once consumed, so the planner sees no demand during the rollout boundary gap.
  • MilesRLixHooks was never wired at init, so every begin_progress_batch hit NoOpRLixHooks; the scheduler never received rollout demand at all.
sequenceDiagram
    participant P1 as Pipeline 1
    participant P2 as Pipeline 2
    participant Sched as Scheduler (gap-ratio)

    Note over P1,Sched: _after_training: actor_train preempts infer GPUs
    Note over Sched: active_dp_ranks = ∅ for both<br/>pending_bucket_gen consumed → planner skips both

    P1->>Sched: begin_progress_batch()  ← first to fire
    Sched->>Sched: wakes P1 engines, allocates all DP workers ✓

    P2->>Sched: begin_progress_batch()  ← too late
    Sched-->>P2: ❌ P1 holds all DP workers
    Note over P2: "No progress for 30.0s. Queue: 0, Collected: 0/N"
Loading

Fix

Pre-signal scheduler demand before the rollout function starts, using request_gpus instead of report_progress.

This makes the call write to the new durable rollout_open_pipelines registry and block until at least one DP worker is awake.

sequenceDiagram
    participant P1 as Pipeline 1
    participant P2 as Pipeline 2
    participant Sched as Scheduler

    Note over P1,Sched: _after_training: actor_train preempts infer GPUs
    Note over Sched: active_dp_ranks = ∅ for both

    P1->>Sched: signal_rollout_demand() → request_gpus(GENERATION, step_target=N)
    P2->>Sched: signal_rollout_demand() → request_gpus(GENERATION, step_target=N)
    Note over Sched: rollout_open_pipelines = {P1: N, P2: N} ← durable<br/>gap-ratio sees both → allocates DP workers for both

    Sched-->>P1: unblocks ✓ (engines awake)
    Sched-->>P2: unblocks ✓ (engines awake)

    P1->>P1: rollout fn dispatches first sample ✓
    P2->>P2: rollout fn dispatches first sample ✓
Loading

Changes

Commit What
dfd53f3 Replace pending_bucket_gen with durable rollout_open_pipelines (pipeline_id → step_target) in SchedulerState; planner reads it for the full rollout lifecycle.
f2ac106 Wire MilesRLixHooks in _init_phase_b_infer; add no-restart tombstones to Orchestrator / SchedulerActor.
173dbcd Add MilesPipeline.signal_rollout_demand() to pre-signal demand before rollout dispatch.
8436e95 Fix signal_rollout_demand to call request_gpus instead of report_progress.
e6276b9 Add tests: static guard for no runtime Orchestrator dependency in scheduler, plus chaos test where orchestrator kill leaves scheduler reachable.

Requires companion miles PR

miles@tianye/4gpu-2ppl-fix

This companion PR wires:

  • rlix_train_loop.signal_demand hook
  • run_miles_dual call-site

Test

Unit tests

  • test_orchestrator_death_is_benign.py
  • test_gap_ratio.py
  • test_scheduler_apply_plan_invariants.py

E2E test

run_4ppl_full_overlap_n10.sh --num-rollout 5

Configuration:

  • P1 train = [0, 1]
  • P1 infer = [0, 1, 2, 3]
  • P2 train = [2, 3]
  • P2 infer = [0, 1, 2, 3]

Result:

  • Run 28: all 5 rollouts clean, EXIT=0

TianyeGGBond and others added 5 commits May 19, 2026 22:19
…lout_open_pipelines

pending_bucket_gen is a per-cycle snapshot that disappears once the
GENERATION request is signalled.  Between signal and the first
ProgressReport from begin_progress_batch(), the planner had no demand
signal for the pipeline and could skip it entirely — stranding it
without GPU workers.

Introduce rollout_open_pipelines (pipeline_id -> step_target_estimate)
in SchedulerState as a durable replacement:
- Set when request_gpus(priority=GENERATION) is enqueued
- Cleared when the GENERATION cluster is released or clear_progress()
  retires all coordinator batch streams

planner.plan_generation_gap_ratio() now accepts rollout_open_pipelines
instead of pending_bucket_gen.  The two helper functions
has_pending_generation_request / get_pending_generation_step_target_estimate
are removed; their logic is inlined against the dict directly.

The bootstrap step_target path (step_target <= 0) and demand inflation
both key off rollout_open_pipelines, which covers the full rollout
lifecycle rather than just the brief window before the pending request
is consumed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tor no-restart tombstones

On top of dfd53f3's rollout_open_pipelines fix, this commit lands the
two pieces of the prior 4-GPU 2-pipeline iteration that are NOT
superseded by the new durable-open-registry scheme:

1) Phase B step 4c — MilesPipeline now calls
   rollout_manager.set_rlix_hooks.remote(MilesRLixHooks(
       coordinator_handle=self._coordinator_handle,
       pipeline_id=self._pipeline_id,
   ))
   inside _init_phase_b_infer (line ~372). Without this, every
   begin_progress_batch / bump_completed from the rollout function
   lands on NoOpRLixHooks and the scheduler never sees rollout
   demand — gap-ratio then has no signal to wake engines for
   rollout N+1 after _after_training released actor_train. Requires
   the matching set_rlix_hooks plumbing in
   TianyeGGBond/miles@tianye/4gpu-2ppl-fix.

2) MilesRLixHooks observability — begin_progress_batch now emits
   one INFO line, _fire_report adds a DEBUG diagnostic, and the
   missing-coordinator-method branch is upgraded from DEBUG to
   WARNING ("progress signal lost"). Lets future hangs be traced
   without a debugger.

Plus tombstone comments on Orchestrator/SchedulerActor max_restarts=0,
max_task_retries=0 documenting WHY restart is unsafe (in-memory state
on _topology_ready / active_allocations / pipeline_registry would be
wiped on a Ray-driven restart, leaving the actor stuck waiting on a
re-set event). Documentation-only; the values were already 0/0.

The obsolete pieces from prior iteration commit 1c9755a (the
step_target_estimate-on-ClusterAllocation Layer-1 fallback,
gen_step_target_estimates hoisting, ClusterAllocation field bump,
and the 354-line tests/test_gap_ratio.py addition) are dropped here
because dfd53f3 makes them redundant — rollout_open_pipelines
already gives the planner durable demand-window visibility without
needing a fallback to the persisted estimate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-signal scheduler demand BEFORE each rollout dispatch so the gap-ratio
planner can wake actor_infer engines without waiting for the rollout
function's first begin_progress_batch — too late under cross-pipeline
contention, where the first pipeline to fire begin_progress_batch wins
the GENERATION budget and the peer pipeline starves on
"Warning: No progress for 30.0s. Queue size: 0, Collected: 0/N".

Root cause: between rollouts, both pipelines' actor_infer DP workers
shrink to set() because actor_train (ACTOR_TRAINING priority) preempts
the shared infer GPUs. The existing gap-ratio fallback (alloc.step_target_estimate
+ v3 peer-signal rule) keeps both pipelines visible to the planner but
does NOT proactively expand engines until a fresh demand signal lands —
and that signal previously arrived inside the rollout function (via
MilesRLixHooks.begin_progress_batch), creating a chicken-and-egg where
the function needs engines awake to dispatch its first sample but the
scheduler needs the first sample to know the function needs engines.

Mirrors rlix/pipeline/full_finetune_pipeline.py:696
(_notify_release_then_request_cluster_gpus) which re-requests actor_infer
at GENERATION with a fresh step_target_estimate per rollout. miles cannot
release+re-request actor_infer between rollouts without tearing down
SGLang engines, so this fix instead publishes a synthetic new_batch=True
progress report shaped exactly like MilesCoordinator._aggregate_and_emit
output — the scheduler "latest wins" overwrites it cleanly when the
real begin_progress_batch lands.

Changes:
- MilesPipeline.signal_rollout_demand(rollout_id, step_target):
  posts ProgressReport(mode="aggregated", new_batch=True, completed=0,
  step_target_trajectories=step_target) to scheduler.report_progress.
  Best-effort: try/except logs warning and returns so a transient RPC
  failure cannot block the rollout (in-rollout begin_progress_batch
  remains the correctness baseline).

- Orchestrator + Scheduler max_concurrency=4 (was default 1) — mitigates
  the Ray-2.55.1 worker.py:1039 "'Worker' object has no attribute
  core_worker" race that fires more readily under the added per-rollout
  signal_demand traffic. Hot-path per-rollout RPC bypasses the
  orchestrator (signal_demand goes pipeline -> scheduler directly), and
  the scheduler's state mutations remain under self._lock = asyncio.Lock,
  so the bump is data-hazard-free.

Test: run 28 of /root/run_4ppl_full_overlap_n10.sh
  (--num-rollout 5, P1 train=[0,1] infer=[0,1,2,3], P2 train=[2,3]
  infer=[0,1,2,3]) — completed all 5 rollouts cleanly with EXIT=0;
  prior runs (run 25 baseline, run 27 with concurrency-bump only)
  hung on rollout 2 collection with "No progress for 30.0s".

Companion miles changes (rlix_train_loop.signal_demand hook +
run_miles_dual driver wire-up) ship on miles@tianye/4gpu-2ppl-fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On the dfd53f3 base, scheduler.report_progress does NOT write to
SchedulerState.rollout_open_pipelines — that registry is only
populated by request_gpus (line 638) and cleared by clear_progress
/ notify_release_gpus. The planner's _receiver_eligible and the
remaining/step_target derivation read FROM this registry.

The previous signal_rollout_demand posted a synthetic ProgressReport
which was a no-op against this contract: the report landed in
latest_progress_by_pipeline but never marked the pipeline as "open"
in rollout_open_pipelines. After the prior rollout's
clear_progress had popped the registry entry, the planner skipped
the pipeline entirely (continue) and engines never woke. Empirical
repro: run 29 (smoke n=3 on rebased branch) hung on rollout 2 with
"No progress for 30.0s. Queue size: 0, Collected: 0/4" for both
pipelines simultaneously.

The fix swaps to scheduler.request_gpus(actor_infer, GENERATION,
step_target_estimate=N). request_gpus's existing-allocation path
at scheduler.py:614-625 gives us the right semantics for free:

- alloc.active_dp_ranks non-empty (mid-rollout) — short-circuits
  and returns the existing GPU list. Cheap no-op.
- alloc.active_dp_ranks == set() (rollout boundary) — enqueues a
  pending request, writes rollout_open_pipelines[pipeline_id] =
  step_target_estimate, fires the wakeup, BLOCKS until the
  planner grants at least one DP worker.

The blocking semantics on the boundary case give us the
synchronization the synthetic ProgressReport could never provide:
when signal_rollout_demand returns, the rollout function is
guaranteed at least one awake engine for its first sample
dispatch.

Also removes the now-unused ProgressReport import and updates
docs/internal/4gpu-2ppl-rollout2-hang-fix.md to describe the
request_gpus contract instead of the synthetic-ProgressReport
contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… test)

Adds two-tier coverage for the Ray-2.55.1 worker.py:1039 race that
max_concurrency=4 only mitigates:

1) test_no_runtime_orchestrator_dependency_in_scheduler — static
   guard. Asserts SchedulerImpl's runtime surface (request_gpus,
   report_progress, notify_release_gpus, clear_progress,
   notify_release_then_request_gpus) does NOT touch the Orchestrator
   at runtime. If a future refactor re-introduces a runtime
   dependency, this fires in CI before it can hit a real training
   run. Always runs (no Ray cluster required).

2) test_scheduler_reachable_after_orchestrator_kill — chaos test.
   Brings up the rlix control plane via client.connect, kills the
   Orchestrator with ray.kill, then verifies the SchedulerActor
   handle stays resolvable via ray.get_actor. Skipped when Ray is
   unavailable.

Updates docs/internal/4gpu-2ppl-rollout2-hang-fix.md with explicit
mitigation-vs-fix analysis: the race itself is upstream Ray; we
cannot patch worker.py:1039 cleanly. But it is architecturally
harmless because (a) all orchestrator callers are driver-startup
or fatal-error-only, (b) the fatal-error path already swallows
ActorDiedError, and (c) the static guard test prevents
regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
zhenyulincs added a commit that referenced this pull request May 24, 2026
TianyeGGBond's paired PRs (#16 + rlops/miles#4) ship the
proper fix for B-13 (mp1 wedge after mp2 completes early under
cross-overlap). The fix uses the scheduler's existing gap-ratio
machinery via pre-signalled durable demand instead of fighting
asyncio.gather, and is materially better than my proposed M11.3
fire-shutdown_hard-early sketch.

Root cause (refined from TianyeGGBond's PR description, Codex review
of the rlix planner code confirms):
- Between rollouts, actor_train preempts shared infer GPUs → both
  pipelines' actor_infer DP workers shrink to set().
- begin_progress_batch fires INSIDE the rollout fn, after dispatch
  starts → first pipeline wins all DP workers, peer hangs.
- Two compounding sub-bugs: (a) pending_bucket_gen is a per-cycle
  snapshot that vanishes once consumed (planner blind during rollout
  boundary); (b) MilesRLixHooks was never wired at init →
  begin_progress_batch hit NoOpRLixHooks (silent no-op).

Joint fix:
- rlix#16: durable rollout_open_pipelines registry in scheduler;
  MilesPipeline.signal_rollout_demand(rollout_id, step_target);
  wires MilesRLixHooks in Phase B init; pending_bucket_gen durability.
- miles#4: _signal_demand step hook in dual driver; set_rlix_hooks +
  _rlix_hooks plumbing on RolloutManager; call_rollout_fn forwards
  rlix_hooks; max_concurrency=4 on coordinator; router 499 + counter
  balance fix.

Codex joint verdict: APPROVE_WITH_NOTES, 0 BLOCKERS. Prior CRITICAL
(signal_rollout_demand missing) + HIGH (set_rlix_hooks not wired)
both RESOLVED in the pair.

Joint smoke evidence (per miles#4 PR): run_4ppl_full_overlap_n10.sh
--num-rollout 5, run 28 EXIT=0 — earlier baseline hung on rollout 2.

Merge order required: rlix#16 MUST land before miles#4 (miles#4 alone
raises AttributeError on pipe.signal_rollout_demand.remote without
the rlix#16 method).

M11.3 follow-up scope shrinks: concurrent-resize stress test under
max_concurrency=4; automated 4-GPU 2-pipeline rollout-2+ regression;
verify generate_rollout_fully_async accepts rlix_hooks kw.
@zhenyulincs zhenyulincs merged commit 80583ee into rlops:zhenyu/miles-mvp-e2e May 24, 2026
zhenyulincs added a commit that referenced this pull request May 24, 2026
Both PRs merged into the zhenyu/* branches; my Phase 1/3 work
fast-forwarded cleanly under both merges (no conflicts because
TianyeGGBond branched off my prior commits).

Verified merged state contains both:
- My Phase 3c finish_init_offload + Phase 1 release_train_only on rlix
- TianyeGGBond's signal_rollout_demand + set_rlix_hooks wiring on rlix
- My Phase 2 overlap env-driven topology on miles
- TianyeGGBond's _signal_demand step hook + max_concurrency=4 on miles

Vast 4×A40 dual-overlap regression smoke pending on a new instance —
to be run with the joint fix to confirm all 7 PASS-bar conditions
(including C2 shutdown_hard complete, previously blocked by B-13).
zhenyulincs added a commit that referenced this pull request May 24, 2026
New: docs/m11-tianye-prs-review.md — standalone document summarizing
TianyeGGBond's paired PRs (#16 + rlops/miles#4) that fixed
B-13 (M11.2 overlap rollout-2+ hang).

Covers:
- Root cause analysis (per-cycle pending_bucket_gen + MilesRLixHooks
  never wired)
- Fix architecture (durable rollout_open_pipelines registry +
  signal_rollout_demand pre-dispatch)
- File-level change inventory across both PRs (~14 files / +300 LoC)
- Validation layers: 5 Codex review rounds, local merge spot-check,
  real vast smoke v8/v9/v10 evidence
- Outstanding follow-ups (max_concurrency=4 stress test, rlix_hooks
  kw verification)
- Doc + commit pointers

Easier to share than the full m11-2-overlap-log.md for reviewers
who only need the Tianye-PR context.
zhenyulincs added a commit that referenced this pull request May 24, 2026
Adds §3.5 "Post-Option-A milestones" tracking 6 new lines of work:

- Phase 1 — R04-F1 try/finally + release_train_only shim (rlix 47bf02b)
- Phase 3 — Option β state machine (finish_init_offload + Codex Q5
  invariants) (rlix 47bf02b)
- Phase 7 — F3+F4 driver-side cleanup (asyncio.wait FIRST_EXCEPTION +
  try/finally shutdown_hard) (miles 79f2874)
- Tianye's paired PRs (#16 + rlops/miles#4) fix B-13 wedge —
  durable rollout_open_pipelines registry + signal_rollout_demand +
  MilesRLixHooks wiring; merged 2026-05-24
- B-14 hardware compat — MILES_SKIP_TMS_PAUSE=1 was unconditionally set;
  removing it on stable-tms hardware enables tms.pause() (rlix a011bbf)
- Batch B-15+F5+F9+F7 — harness C0 + nvidia-smi INFO + odd-GPU
  validation + pause_generation docs (rlix a4c6369 + miles 1487c3f)

Updates §1.4 status table — M11.2 now has 3 rows:
- Option A (disjoint pools, M11.2 Attempt 4 GREEN)
- Real overlap (shared infer pool, verified 2026-05-24 v8/v9/v10)
- M11.3+ deferred

Updates §6 known limitations — marks F1/F3/F4/B-13/B-14/B-15/F5/F7/F9
as RESOLVED with commit pointers. Adds rows for MED1 (concurrent
resize stress) + LOW1 (rlix_hooks kw) follow-ups. F2 noted as
howard989's PR in flight.

Updates Appendix C — current HEADs (rlix 9b1fa90, miles 1487c3f).

Cross-refs to plans/m11-2-overlap-log.md + docs/m11-tianye-prs-review.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants