fix: GDN F.conv1d fallback 缺少 groups 参数 + out_norm 昇腾 NPU 兼容#94
fix: GDN F.conv1d fallback 缺少 groups 参数 + out_norm 昇腾 NPU 兼容#94Weizhena wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the model building process in qwen3_next_gdn.py to safely handle zero_centered_gamma configurations across different normalization layers and updates a convolution operation in gated_delta_net.py by specifying the groups parameter. Review feedback recommends adding a more specific attribute check for the configuration object to ensure robustness against missing attributes and notes that modifying shared configuration objects may have unintended global effects.
| elif hasattr(out_norm, 'config'): | ||
| out_norm.config.layernorm_zero_centered_gamma = False |
There was a problem hiding this comment.
在访问 out_norm.config.layernorm_zero_centered_gamma 之前,建议增加对该属性是否存在的检查。虽然在昇腾 NPU 环境下 MindSpeed 的 RMSNorm 预期包含此配置,但为了代码的健壮性,防止在其他环境或不同版本的配置对象上触发 AttributeError,使用 hasattr 检查会更安全。此外,由于 config 对象通常在多个层之间共享,修改此属性可能会产生全局影响,请确认这是否符合预期。
| elif hasattr(out_norm, 'config'): | |
| out_norm.config.layernorm_zero_centered_gamma = False | |
| elif hasattr(out_norm, 'config') and hasattr(out_norm.config, 'layernorm_zero_centered_gamma'): | |
| out_norm.config.layernorm_zero_centered_gamma = False |
|
哈咯!请合并到main分支 我会手动cherry-pick过去 |
你好,已经放过去了 |
| if hasattr(out_norm, 'zero_centered_gamma'): | ||
| out_norm.zero_centered_gamma = False | ||
| elif hasattr(out_norm, 'config'): | ||
| out_norm.config.layernorm_zero_centered_gamma = False |
There was a problem hiding this comment.
config参数是共享的,会导致其他所有的 config.layernorm_zero_centered_gamma都为False。
但我记得只有 out_norm 需要是 layernorm_zero_centered_gamma 为False
There was a problem hiding this comment.
你好,感谢回复!现在希望qwen3.6-27b 训练,验证发现确实不对,现在采取
import copy
if hasattr(layer.self_attention, 'out_norm'):
out_norm = layer.self_attention.out_norm
if hasattr(out_norm, 'zero_centered_gamma'):
out_norm.zero_centered_gamma = False
elif hasattr(out_norm, 'config'):
out_norm.config = copy.copy(out_norm.config)
out_norm.config.layernorm_zero_centered_gamma = False
return model
这样改动是否能暂时正确运行? 经过这样的改动方式,sft首轮loss从13调到了1.4
|
哈咯 我先单独提个PR 将conv1d的bug修复了 |
背景
ms-swift (Megatron-SWIFT) 0.16.x + mcore-bridge release/1.4 已支持 Qwen3.5/3.6 系列 GDN(Gated Delta Net)在 NVIDIA GPU 上通过 TransformerEngine + fla 进行 TP/PP 并行训练。但在昇腾 910B3 NPU 上,mcore-bridge 的 F.conv1d 回退路径和模型初始化存在两个问题,导致训练无法运行。
注:NVIDIA 上游 Megatron-Core 的
gated_delta_net.py和 MindSpeed fork 中均已包含groups=修复,本 PR 使 mcore-bridge 与上游保持一致。问题 1:F.conv1d fallback 缺少
groups参数(gated_delta_net.py)当
causal_conv1d不可用时(如 fla Triton kernel 在 910B3 上 UB 溢出),代码回退到F.conv1d。但 GDN 中的 conv1d 是逐通道卷积(depthwise convolution),每个通道用独立的滤波器。回退代码缺少groups=qkv.shape[1],导致:这是一个正确性 bug:即使 shape 碰巧匹配,缺少
groups也会导致 PyTorch 将其作为标准卷积(而非逐通道卷积)处理,数学结果错误。修复:添加
groups=qkv.shape[1],与上游 Megatron-Core 保持一致。问题 2:
zero_centered_gammaassert 在 MindSpeed RMSNorm 下失败(qwen3_next_gdn.py)模型初始化时,
Qwen3NextLoader直接 assertout_norm具有zero_centered_gamma属性。该属性在 TransformerEngine 的FusedLayerNorm上存在,但昇腾 NPU 上 MindSpeed 使用自己的RMSNorm,该设置存储在config.layernorm_zero_centered_gamma中。硬 assert 直接崩溃:修复:将 assert 替换为鸭子类型检查,同时兼容 TransformerEngine(
out_norm.zero_centered_gamma)和 MindSpeed RMSNorm(out_norm.config.layernorm_zero_centered_gamma)。实测验证
在昇腾 910B3(8x64GB HBM,双机 16 卡)上测试通过: