fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic by RaymondLi0 · Pull Request #142 · ServiceNow/PipelineRL

RaymondLi0 · 2026-05-12T19:31:28Z

Summary

pipelinerl.actor.is_trainer_finished() decides when the rollout actor should stop feeding the redis data stream. On the legacy HF/DeepSpeed trainer path the implementation is correct, but on the Fast-LLM path it fires several optimizer steps early, the trainer then times out waiting for documents, and the job dies.

Symptom (single 8-GPU node, math gspo recipe, max_train_steps=400, docs_per_step=1024, train_iters=400, gradient_accumulation_passes=1024 auto-adjusted to 1026 to divide 6 trainer ranks):

[finetune rank 0] training @ step 393/400 | consumed tokens: 452,736,000 | ...
[actor]: ... is_trainer_finished() returns True at samples_processed=410,400
[actor]: train scheduler for llms 0,1: rollout task encountered ServerDisconnectedError
         after trainer completion; ignoring
[actor]: Rollout maker loop closed
... (10 minutes later, no new docs in redis) ...
fast_llm/data/dataset/streaming.py:175: TimeoutError: No document received after 600 seconds
vLLM stderr: [FastLLM] training_finished was not received; forcing stop
              EngineDeadError

Root cause

The current check is the same for both trainer paths (pipelinerl/actor.py:158, 170-174):

samples_target = final_steps * cfg.finetune.train_batch_size * cfg.finetune.gradient_accumulation_passes

def is_trainer_finished() -> bool:
    return (
        trainer_state.samples_processed is not None
        and trainer_state.samples_processed >= samples_target
    )

This formula is correct for the HF/DeepSpeed path: gradient_accumulation_passes counts microbatches per optimizer step, and train_batch_size × gradient_accumulation_passes = samples per step.

Under Fast-LLM (use_fast_llm=True), gradient_accumulation_passes is vestigial — it is not propagated to the Fast-LLM trainer, which has its own schedule.docs_per_step configuration. Fast-LLM's _prefetch_to_doc_target (fast_llm/engine/training/trainer.py) also always overshoots docs_per_step by a few documents per step:

target = self._config.schedule.docs_per_step
total_docs = 0
while total_docs < target:
    mb = next(data_iterator)
    total_docs += mb[0].num_documents_in_batch  # variable per microbatch

The compound effect: with docs_per_step=1024, real per-step consumption is ~1043–1044 docs, so 400 trainer steps consume ~417k docs total. But the actor's samples_target is computed as 400 × 1 × 1026 = 410,400. The actor declares completion when the trainer is around step ~393/400, stops feeding redis, and the trainer eventually hits its 600s RedisStreamingDataset timeout.

The colleague's multinode submit script (submit_eai_math_7b_multinode.sh) hits the same bug latent — gradient_accumulation_passes=1024 (no divisibility adjustment because finetune_fraction=4 × 4 nodes = 16 ranks divides 1024 evenly), so the actor stops at trainer step ~392 instead of 400. The truncation is small enough it has gone unnoticed.

Fix

Fast-LLM already publishes an explicit {"type": "training_finished"} event to the fast_llm_events redis stream at the natural end of training (fast_llm/engine/training/streaming.py:train_end), and pipelinerl/state.py:112-115 already listens for it and sets TrainerState.training_done = True. The vLLM workers and TrainerState consume the same signal (pipelinerl/vllm1.py:534-547, pipelinerl/state.py:112-115).

This PR makes the actor's completion check route to that signal under Fast-LLM, falling back to the sample-counting heuristic for the HF/DeepSpeed path. Two call sites are updated symmetrically:

pipelinerl/actor.py:170-174 (is_trainer_finished function used by schedule_rollouts).
pipelinerl/actor.py:612-616 (inline mirror inside ActorLoop.run).

The HF/DeepSpeed path is unchanged. No new config keys, no race, no overshoot accounting, no implicit dependence on gradient_accumulation_passes.

Test plan

Reproduce the bug locally: math gspo run, max_train_steps=400, single 8-GPU node — observe is_trainer_finished() returning True at samples_processed=410,400 with the trainer at step 393.
After this PR: rerun same config, verify the trainer reaches step 400, emits training_finished, the actor sees trainer_state.training_done=True, and shuts down cleanly without EngineDeadError.
Spot-check the HF/DeepSpeed path with use_fast_llm=false is unchanged: samples_target calculation and check are unmodified for that branch.

🤖 Generated with Claude Code

…ng heuristic The actor decides when training is done via `is_trainer_finished()`. On the legacy HF/DeepSpeed trainer path, the check is: samples_target = max_train_steps * train_batch_size * gradient_accumulation_passes samples_processed >= samples_target This is correct for that path because `gradient_accumulation_passes` counts the microbatches per optimizer step, so the product is samples per step. Under Fast-LLM (`use_fast_llm=True`), the trainer uses its own `schedule.docs_per_step` knob instead, and `gradient_accumulation_passes` becomes vestigial — it is not propagated to the Fast-LLM trainer. Worse, Fast-LLM's `_prefetch_to_doc_target` always overshoots `docs_per_step` by a few documents per step (the loop runs `while total_docs < target` and stops just after crossing it). The accumulated overshoot, combined with the unrelated `gradient_accumulation_passes` value (which auto-adjusts to be divisible by the trainer-rank count, e.g. 1024 -> 1026 for finetune_fraction=6), makes `samples_target` finish hundreds of documents short of the trainer's true 400-step consumption. Concrete reproduction on a single 8-GPU node with the math gspo recipe (`max_train_steps=400`, `gradient_accumulation_passes=1026`, `docs_per_step=1024`, `train_iters=400`): actor declares completion at samples_processed=410,400 while the Fast-LLM trainer is only at step ~393/400, stops feeding redis, the trainer's `RedisStreamingDataset` times out after 600s with `No document received`, vLLM logs `[FastLLM] training_finished was not received; forcing stop`, and the job dies with `EngineDeadError`. Fast-LLM already publishes an explicit `{"type": "training_finished"}` event to the `fast_llm_events` redis stream at the natural end of training (`fast_llm/engine/training/streaming.py:train_end`), and PipelineRL already listens for it and sets `TrainerState.training_done = True` (`pipelinerl/state.py:112-115`). This commit makes `is_trainer_finished()` — and its inline mirror in `ActorLoop.run()` at line ~614 — consult that flag when `use_fast_llm=True`, while preserving the sample-counting heuristic for the HF/DeepSpeed path. No race, no overshoot accounting, no implicit dependence on `gradient_accumulation_passes`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic#142

fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic#142
RaymondLi0 wants to merge 1 commit into
fast-llmfrom
fix/actor-finished-uses-training-done-for-fast-llm

RaymondLi0 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RaymondLi0 commented May 12, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant