Support qwen3.5 loss mask for multi-turn SFT#1742
Merged
zhuzilin merged 2 commits intoTHUDM:mainfrom Mar 22, 2026
Merged
Conversation
Contributor
Author
|
@zhuzilin @Zhuohao-Li Please help review it. 💗 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
--loss-mask-typedefaults toqwen. When Qwen3.5 SFT is launched with the default setting, SFT rollout still goes through the legacyqwenloss-mask path, which is incompatible with Qwen3.5 multi-turn chat-template behavior and can fail with:jinja2.exceptions.TemplateError: No user query found in messages.Using
--loss-mask-type qwen3avoids the immediate crash, but it still does not match Qwen3.5 masking semantics on multi-turn conversations. For historical assistant turns, Qwen3.5 usually keeps only the final answer, while the currentqwen3path may reconstruct extra reasoning content and supervise unnecessary thinking tokens. This increases token count and slows down SFT.Changes
qwen3_5as a valid--loss-mask-typeoffset_mappingapply_chat_template(..., tokenize=True)outputtoken_idsandloss_maskalways have the same length--loss-mask-type qwen3_5Why this PR
qwenloss-mask path is usedScope
This PR does not change the global default of
--loss-mask-type. Instead, it introduces a Qwen3.5-specific option and updates the Qwen3.5 SFT entry script to use it explicitly, which keeps existing Qwen/Qwen3 behavior unchanged.Testing
python -m pytest tests/utils/test_loss_mask_type_qwen35.py