Skip to content

fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic#142

Open
RaymondLi0 wants to merge 1 commit into
fast-llmfrom
fix/actor-finished-uses-training-done-for-fast-llm
Open

fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic#142
RaymondLi0 wants to merge 1 commit into
fast-llmfrom
fix/actor-finished-uses-training-done-for-fast-llm

Conversation

@RaymondLi0
Copy link
Copy Markdown
Collaborator

Summary

pipelinerl.actor.is_trainer_finished() decides when the rollout actor should stop feeding the redis data stream. On the legacy HF/DeepSpeed trainer path the implementation is correct, but on the Fast-LLM path it fires several optimizer steps early, the trainer then times out waiting for documents, and the job dies.

Symptom (single 8-GPU node, math gspo recipe, max_train_steps=400, docs_per_step=1024, train_iters=400, gradient_accumulation_passes=1024 auto-adjusted to 1026 to divide 6 trainer ranks):

[finetune rank 0] training @ step 393/400 | consumed tokens: 452,736,000 | ...
[actor]: ... is_trainer_finished() returns True at samples_processed=410,400
[actor]: train scheduler for llms 0,1: rollout task encountered ServerDisconnectedError
         after trainer completion; ignoring
[actor]: Rollout maker loop closed
... (10 minutes later, no new docs in redis) ...
fast_llm/data/dataset/streaming.py:175: TimeoutError: No document received after 600 seconds
vLLM stderr: [FastLLM] training_finished was not received; forcing stop
              EngineDeadError

Root cause

The current check is the same for both trainer paths (pipelinerl/actor.py:158, 170-174):

samples_target = final_steps * cfg.finetune.train_batch_size * cfg.finetune.gradient_accumulation_passes

def is_trainer_finished() -> bool:
    return (
        trainer_state.samples_processed is not None
        and trainer_state.samples_processed >= samples_target
    )

This formula is correct for the HF/DeepSpeed path: gradient_accumulation_passes counts microbatches per optimizer step, and train_batch_size × gradient_accumulation_passes = samples per step.

Under Fast-LLM (use_fast_llm=True), gradient_accumulation_passes is vestigial — it is not propagated to the Fast-LLM trainer, which has its own schedule.docs_per_step configuration. Fast-LLM's _prefetch_to_doc_target (fast_llm/engine/training/trainer.py) also always overshoots docs_per_step by a few documents per step:

target = self._config.schedule.docs_per_step
total_docs = 0
while total_docs < target:
    mb = next(data_iterator)
    total_docs += mb[0].num_documents_in_batch  # variable per microbatch

The compound effect: with docs_per_step=1024, real per-step consumption is ~1043–1044 docs, so 400 trainer steps consume ~417k docs total. But the actor's samples_target is computed as 400 × 1 × 1026 = 410,400. The actor declares completion when the trainer is around step ~393/400, stops feeding redis, and the trainer eventually hits its 600s RedisStreamingDataset timeout.

The colleague's multinode submit script (submit_eai_math_7b_multinode.sh) hits the same bug latent — gradient_accumulation_passes=1024 (no divisibility adjustment because finetune_fraction=4 × 4 nodes = 16 ranks divides 1024 evenly), so the actor stops at trainer step ~392 instead of 400. The truncation is small enough it has gone unnoticed.

Fix

Fast-LLM already publishes an explicit {"type": "training_finished"} event to the fast_llm_events redis stream at the natural end of training (fast_llm/engine/training/streaming.py:train_end), and pipelinerl/state.py:112-115 already listens for it and sets TrainerState.training_done = True. The vLLM workers and TrainerState consume the same signal (pipelinerl/vllm1.py:534-547, pipelinerl/state.py:112-115).

This PR makes the actor's completion check route to that signal under Fast-LLM, falling back to the sample-counting heuristic for the HF/DeepSpeed path. Two call sites are updated symmetrically:

  • pipelinerl/actor.py:170-174 (is_trainer_finished function used by schedule_rollouts).
  • pipelinerl/actor.py:612-616 (inline mirror inside ActorLoop.run).

The HF/DeepSpeed path is unchanged. No new config keys, no race, no overshoot accounting, no implicit dependence on gradient_accumulation_passes.

Test plan

  • Reproduce the bug locally: math gspo run, max_train_steps=400, single 8-GPU node — observe is_trainer_finished() returning True at samples_processed=410,400 with the trainer at step 393.
  • After this PR: rerun same config, verify the trainer reaches step 400, emits training_finished, the actor sees trainer_state.training_done=True, and shuts down cleanly without EngineDeadError.
  • Spot-check the HF/DeepSpeed path with use_fast_llm=false is unchanged: samples_target calculation and check are unmodified for that branch.

🤖 Generated with Claude Code

…ng heuristic

The actor decides when training is done via `is_trainer_finished()`. On the
legacy HF/DeepSpeed trainer path, the check is:

    samples_target = max_train_steps * train_batch_size * gradient_accumulation_passes
    samples_processed >= samples_target

This is correct for that path because `gradient_accumulation_passes` counts
the microbatches per optimizer step, so the product is samples per step.

Under Fast-LLM (`use_fast_llm=True`), the trainer uses its own
`schedule.docs_per_step` knob instead, and `gradient_accumulation_passes`
becomes vestigial — it is not propagated to the Fast-LLM trainer. Worse,
Fast-LLM's `_prefetch_to_doc_target` always overshoots `docs_per_step` by a
few documents per step (the loop runs `while total_docs < target` and stops
just after crossing it). The accumulated overshoot, combined with the
unrelated `gradient_accumulation_passes` value (which auto-adjusts to be
divisible by the trainer-rank count, e.g. 1024 -> 1026 for finetune_fraction=6),
makes `samples_target` finish hundreds of documents short of the trainer's
true 400-step consumption.

Concrete reproduction on a single 8-GPU node with the math gspo recipe
(`max_train_steps=400`, `gradient_accumulation_passes=1026`, `docs_per_step=1024`,
`train_iters=400`): actor declares completion at samples_processed=410,400 while
the Fast-LLM trainer is only at step ~393/400, stops feeding redis, the
trainer's `RedisStreamingDataset` times out after 600s with
`No document received`, vLLM logs `[FastLLM] training_finished was not received;
forcing stop`, and the job dies with `EngineDeadError`.

Fast-LLM already publishes an explicit `{"type": "training_finished"}` event to
the `fast_llm_events` redis stream at the natural end of training
(`fast_llm/engine/training/streaming.py:train_end`), and PipelineRL already
listens for it and sets `TrainerState.training_done = True`
(`pipelinerl/state.py:112-115`). This commit makes `is_trainer_finished()` —
and its inline mirror in `ActorLoop.run()` at line ~614 — consult that flag
when `use_fast_llm=True`, while preserving the sample-counting heuristic for
the HF/DeepSpeed path. No race, no overshoot accounting, no implicit dependence
on `gradient_accumulation_passes`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant