fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic#142
Open
RaymondLi0 wants to merge 1 commit into
Open
fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic#142RaymondLi0 wants to merge 1 commit into
RaymondLi0 wants to merge 1 commit into
Conversation
…ng heuristic
The actor decides when training is done via `is_trainer_finished()`. On the
legacy HF/DeepSpeed trainer path, the check is:
samples_target = max_train_steps * train_batch_size * gradient_accumulation_passes
samples_processed >= samples_target
This is correct for that path because `gradient_accumulation_passes` counts
the microbatches per optimizer step, so the product is samples per step.
Under Fast-LLM (`use_fast_llm=True`), the trainer uses its own
`schedule.docs_per_step` knob instead, and `gradient_accumulation_passes`
becomes vestigial — it is not propagated to the Fast-LLM trainer. Worse,
Fast-LLM's `_prefetch_to_doc_target` always overshoots `docs_per_step` by a
few documents per step (the loop runs `while total_docs < target` and stops
just after crossing it). The accumulated overshoot, combined with the
unrelated `gradient_accumulation_passes` value (which auto-adjusts to be
divisible by the trainer-rank count, e.g. 1024 -> 1026 for finetune_fraction=6),
makes `samples_target` finish hundreds of documents short of the trainer's
true 400-step consumption.
Concrete reproduction on a single 8-GPU node with the math gspo recipe
(`max_train_steps=400`, `gradient_accumulation_passes=1026`, `docs_per_step=1024`,
`train_iters=400`): actor declares completion at samples_processed=410,400 while
the Fast-LLM trainer is only at step ~393/400, stops feeding redis, the
trainer's `RedisStreamingDataset` times out after 600s with
`No document received`, vLLM logs `[FastLLM] training_finished was not received;
forcing stop`, and the job dies with `EngineDeadError`.
Fast-LLM already publishes an explicit `{"type": "training_finished"}` event to
the `fast_llm_events` redis stream at the natural end of training
(`fast_llm/engine/training/streaming.py:train_end`), and PipelineRL already
listens for it and sets `TrainerState.training_done = True`
(`pipelinerl/state.py:112-115`). This commit makes `is_trainer_finished()` —
and its inline mirror in `ActorLoop.run()` at line ~614 — consult that flag
when `use_fast_llm=True`, while preserving the sample-counting heuristic for
the HF/DeepSpeed path. No race, no overshoot accounting, no implicit dependence
on `gradient_accumulation_passes`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pipelinerl.actor.is_trainer_finished()decides when the rollout actor should stop feeding the redis data stream. On the legacy HF/DeepSpeed trainer path the implementation is correct, but on the Fast-LLM path it fires several optimizer steps early, the trainer then times out waiting for documents, and the job dies.Symptom (single 8-GPU node, math gspo recipe,
max_train_steps=400,docs_per_step=1024,train_iters=400,gradient_accumulation_passes=1024auto-adjusted to 1026 to divide 6 trainer ranks):Root cause
The current check is the same for both trainer paths (
pipelinerl/actor.py:158, 170-174):This formula is correct for the HF/DeepSpeed path:
gradient_accumulation_passescounts microbatches per optimizer step, andtrain_batch_size × gradient_accumulation_passes= samples per step.Under Fast-LLM (
use_fast_llm=True),gradient_accumulation_passesis vestigial — it is not propagated to the Fast-LLM trainer, which has its ownschedule.docs_per_stepconfiguration. Fast-LLM's_prefetch_to_doc_target(fast_llm/engine/training/trainer.py) also always overshootsdocs_per_stepby a few documents per step:The compound effect: with
docs_per_step=1024, real per-step consumption is ~1043–1044 docs, so 400 trainer steps consume ~417k docs total. But the actor'ssamples_targetis computed as400 × 1 × 1026 = 410,400. The actor declares completion when the trainer is around step ~393/400, stops feeding redis, and the trainer eventually hits its 600sRedisStreamingDatasettimeout.The colleague's multinode submit script (
submit_eai_math_7b_multinode.sh) hits the same bug latent —gradient_accumulation_passes=1024(no divisibility adjustment because finetune_fraction=4 × 4 nodes = 16 ranks divides 1024 evenly), so the actor stops at trainer step ~392 instead of 400. The truncation is small enough it has gone unnoticed.Fix
Fast-LLM already publishes an explicit
{"type": "training_finished"}event to thefast_llm_eventsredis stream at the natural end of training (fast_llm/engine/training/streaming.py:train_end), andpipelinerl/state.py:112-115already listens for it and setsTrainerState.training_done = True. The vLLM workers andTrainerStateconsume the same signal (pipelinerl/vllm1.py:534-547,pipelinerl/state.py:112-115).This PR makes the actor's completion check route to that signal under Fast-LLM, falling back to the sample-counting heuristic for the HF/DeepSpeed path. Two call sites are updated symmetrically:
pipelinerl/actor.py:170-174(is_trainer_finishedfunction used byschedule_rollouts).pipelinerl/actor.py:612-616(inline mirror insideActorLoop.run).The HF/DeepSpeed path is unchanged. No new config keys, no race, no overshoot accounting, no implicit dependence on
gradient_accumulation_passes.Test plan
max_train_steps=400, single 8-GPU node — observeis_trainer_finished()returning True at samples_processed=410,400 with the trainer at step 393.training_finished, the actor seestrainer_state.training_done=True, and shuts down cleanly withoutEngineDeadError.use_fast_llm=falseis unchanged:samples_targetcalculation and check are unmodified for that branch.🤖 Generated with Claude Code