Skip to content

fix: GDN F.conv1d fallback 缺少 groups 参数 + out_norm 昇腾 NPU 兼容#94

Open
Weizhena wants to merge 2 commits into
modelscope:mainfrom
Weizhena:fix/npu-gdn-conv1d-outnorm
Open

fix: GDN F.conv1d fallback 缺少 groups 参数 + out_norm 昇腾 NPU 兼容#94
Weizhena wants to merge 2 commits into
modelscope:mainfrom
Weizhena:fix/npu-gdn-conv1d-outnorm

Conversation

@Weizhena
Copy link
Copy Markdown

背景

ms-swift (Megatron-SWIFT) 0.16.x + mcore-bridge release/1.4 已支持 Qwen3.5/3.6 系列 GDN(Gated Delta Net)在 NVIDIA GPU 上通过 TransformerEngine + fla 进行 TP/PP 并行训练。但在昇腾 910B3 NPU 上,mcore-bridge 的 F.conv1d 回退路径和模型初始化存在两个问题,导致训练无法运行。

注:NVIDIA 上游 Megatron-Core 的 gated_delta_net.py 和 MindSpeed fork 中均已包含 groups= 修复,本 PR 使 mcore-bridge 与上游保持一致。

问题 1:F.conv1d fallback 缺少 groups 参数(gated_delta_net.py)

causal_conv1d 不可用时(如 fla Triton kernel 在 910B3 上 UB 溢出),代码回退到 F.conv1d。但 GDN 中的 conv1d 是逐通道卷积(depthwise convolution),每个通道用独立的滤波器。回退代码缺少 groups=qkv.shape[1],导致:

RuntimeError: expected input to have 1 channels, but got 1280 channels

这是一个正确性 bug:即使 shape 碰巧匹配,缺少 groups 也会导致 PyTorch 将其作为标准卷积(而非逐通道卷积)处理,数学结果错误。

修复:添加 groups=qkv.shape[1],与上游 Megatron-Core 保持一致。

问题 2:zero_centered_gamma assert 在 MindSpeed RMSNorm 下失败(qwen3_next_gdn.py)

模型初始化时,Qwen3NextLoader 直接 assert out_norm 具有 zero_centered_gamma 属性。该属性在 TransformerEngine 的 FusedLayerNorm 上存在,但昇腾 NPU 上 MindSpeed 使用自己的 RMSNorm,该设置存储在 config.layernorm_zero_centered_gamma 中。硬 assert 直接崩溃:

AssertionError: assert hasattr(layer.self_attention.out_norm, 'zero_centered_gamma')

修复:将 assert 替换为鸭子类型检查,同时兼容 TransformerEngine(out_norm.zero_centered_gamma)和 MindSpeed RMSNorm(out_norm.config.layernorm_zero_centered_gamma)。

实测验证

在昇腾 910B3(8x64GB HBM,双机 16 卡)上测试通过:

  • 模型:qwen3.6-27B 全量 SFT
  • 配置:TP=4, PP=2, max_length=16384, micro_batch_size=1
  • 框架:ms-swift (megatron sft) + MindSpeed core_r0.16.0 + mcore-bridge release/1.4
  • 结果:两处修复后训练稳定运行。未修复问题 1 时,首个 GDN forward pass 即崩溃;未修复问题 2 时,模型加载阶段即失败。

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the model building process in qwen3_next_gdn.py to safely handle zero_centered_gamma configurations across different normalization layers and updates a convolution operation in gated_delta_net.py by specifying the groups parameter. Review feedback recommends adding a more specific attribute check for the configuration object to ensure robustness against missing attributes and notes that modifying shared configuration objects may have unintended global effects.

Comment on lines +144 to +145
elif hasattr(out_norm, 'config'):
out_norm.config.layernorm_zero_centered_gamma = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

在访问 out_norm.config.layernorm_zero_centered_gamma 之前,建议增加对该属性是否存在的检查。虽然在昇腾 NPU 环境下 MindSpeed 的 RMSNorm 预期包含此配置,但为了代码的健壮性,防止在其他环境或不同版本的配置对象上触发 AttributeError,使用 hasattr 检查会更安全。此外,由于 config 对象通常在多个层之间共享,修改此属性可能会产生全局影响,请确认这是否符合预期。

Suggested change
elif hasattr(out_norm, 'config'):
out_norm.config.layernorm_zero_centered_gamma = False
elif hasattr(out_norm, 'config') and hasattr(out_norm.config, 'layernorm_zero_centered_gamma'):
out_norm.config.layernorm_zero_centered_gamma = False

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

Jintao-Huang commented May 26, 2026

哈咯!请合并到main分支

我会手动cherry-pick过去

@Weizhena Weizhena changed the base branch from release/1.4 to main May 26, 2026 07:47
@Weizhena
Copy link
Copy Markdown
Author

哈咯!请合并到main分支

我会手动cherry-pick过去

你好,已经放过去了

if hasattr(out_norm, 'zero_centered_gamma'):
out_norm.zero_centered_gamma = False
elif hasattr(out_norm, 'config'):
out_norm.config.layernorm_zero_centered_gamma = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config参数是共享的,会导致其他所有的 config.layernorm_zero_centered_gamma都为False。

但我记得只有 out_norm 需要是 layernorm_zero_centered_gamma 为False

Copy link
Copy Markdown
Author

@Weizhena Weizhena May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你好,感谢回复!现在希望qwen3.6-27b 训练,验证发现确实不对,现在采取
import copy
if hasattr(layer.self_attention, 'out_norm'):
out_norm = layer.self_attention.out_norm
if hasattr(out_norm, 'zero_centered_gamma'):
out_norm.zero_centered_gamma = False
elif hasattr(out_norm, 'config'):
out_norm.config = copy.copy(out_norm.config)
out_norm.config.layernorm_zero_centered_gamma = False
return model

                这样改动是否能暂时正确运行?   经过这样的改动方式,sft首轮loss从13调到了1.4

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

#97

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

哈咯 我先单独提个PR 将conv1d的bug修复了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants