Skip to content

Log SFT metrics during training#719

Open
Kovbo wants to merge 1 commit into
mainfrom
fix/sft-metrics-all-steps
Open

Log SFT metrics during training#719
Kovbo wants to merge 1 commit into
mainfrom
fix/sft-metrics-all-steps

Conversation

@Kovbo
Copy link
Copy Markdown
Collaborator

@Kovbo Kovbo commented Jun 5, 2026

Summary

  • Log SFT optimizer metrics at configurable gradient-step intervals without creating per-step checkpoints. This adds a separate sft/* W&B namespace for detailed SFT metrics, using sft/gradient_step as the x-axis. SFT jobs still produce one checkpoint and one aggregate training metric, while detailed per-step metrics are logged under the separate SFT split.
  • Keep local backend logging client-owned, while letting remote SFT backends own server-side metric logging. This is intentionally different from RL: for Serverless SFT, the client may disconnect after starting training, so server-side logging is required if we want metrics to continue being recorded.
  • Forward SFT metric logging config through serverless training jobs.
  • Define W&B routing for the sft/* namespace.
  • Update dev SFT scripts to exercise Megatron/Qwen SFT.

Testing

  • uv run ty check src tests
  • uv run pytest tests/unit/test_frontend_logging.py::TestTrainSFTMetricsAggregation tests/unit/test_metric_routing.py tests/unit/test_serverless_pipeline_trainer_compat.py -q
  • uv run python -m py_compile dev/sft/sft-from-file.py dev/sft/sft-warmup.py

@Kovbo Kovbo force-pushed the fix/sft-metrics-all-steps branch from 895895f to 2eb7d68 Compare June 5, 2026 02:06
@Kovbo Kovbo force-pushed the fix/sft-metrics-all-steps branch from 2eb7d68 to 5fb6b46 Compare June 5, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant