fix: GDN F.conv1d fallback 缺少 groups 参数 + out_norm 昇腾 NPU 兼容 by Weizhena · Pull Request #94 · modelscope/mcore-bridge

Weizhena · 2026-05-25T15:36:44Z

背景

ms-swift (Megatron-SWIFT) 0.16.x + mcore-bridge release/1.4 已支持 Qwen3.5/3.6 系列 GDN（Gated Delta Net）在 NVIDIA GPU 上通过 TransformerEngine + fla 进行 TP/PP 并行训练。但在昇腾 910B3 NPU 上，mcore-bridge 的 F.conv1d 回退路径和模型初始化存在两个问题，导致训练无法运行。

注：NVIDIA 上游 Megatron-Core 的 gated_delta_net.py 和 MindSpeed fork 中均已包含 groups= 修复，本 PR 使 mcore-bridge 与上游保持一致。

问题 1：F.conv1d fallback 缺少 `groups` 参数（gated_delta_net.py）

当 causal_conv1d 不可用时（如 fla Triton kernel 在 910B3 上 UB 溢出），代码回退到 F.conv1d。但 GDN 中的 conv1d 是逐通道卷积（depthwise convolution），每个通道用独立的滤波器。回退代码缺少 groups=qkv.shape[1]，导致：

RuntimeError: expected input to have 1 channels, but got 1280 channels

这是一个正确性 bug：即使 shape 碰巧匹配，缺少 groups 也会导致 PyTorch 将其作为标准卷积（而非逐通道卷积）处理，数学结果错误。

修复：添加 groups=qkv.shape[1]，与上游 Megatron-Core 保持一致。

问题 2：`zero_centered_gamma` assert 在 MindSpeed RMSNorm 下失败（qwen3_next_gdn.py）

模型初始化时，Qwen3NextLoader 直接 assert out_norm 具有 zero_centered_gamma 属性。该属性在 TransformerEngine 的 FusedLayerNorm 上存在，但昇腾 NPU 上 MindSpeed 使用自己的 RMSNorm，该设置存储在 config.layernorm_zero_centered_gamma 中。硬 assert 直接崩溃：

AssertionError: assert hasattr(layer.self_attention.out_norm, 'zero_centered_gamma')

修复：将 assert 替换为鸭子类型检查，同时兼容 TransformerEngine（out_norm.zero_centered_gamma）和 MindSpeed RMSNorm（out_norm.config.layernorm_zero_centered_gamma）。

实测验证

在昇腾 910B3（8x64GB HBM，双机 16 卡）上测试通过：

模型：qwen3.6-27B 全量 SFT
配置：TP=4, PP=2, max_length=16384, micro_batch_size=1
框架：ms-swift (megatron sft) + MindSpeed core_r0.16.0 + mcore-bridge release/1.4
结果：两处修复后训练稳定运行。未修复问题 1 时，首个 GDN forward pass 即崩溃；未修复问题 2 时，模型加载阶段即失败。

gemini-code-assist

Code Review

This pull request modifies the model building process in qwen3_next_gdn.py to safely handle zero_centered_gamma configurations across different normalization layers and updates a convolution operation in gated_delta_net.py by specifying the groups parameter. Review feedback recommends adding a more specific attribute check for the configuration object to ensure robustness against missing attributes and notes that modifying shared configuration objects may have unintended global effects.

gemini-code-assist · 2026-05-25T15:38:05Z

+                elif hasattr(out_norm, 'config'):
+                    out_norm.config.layernorm_zero_centered_gamma = False


在访问 out_norm.config.layernorm_zero_centered_gamma 之前，建议增加对该属性是否存在的检查。虽然在昇腾 NPU 环境下 MindSpeed 的 RMSNorm 预期包含此配置，但为了代码的健壮性，防止在其他环境或不同版本的配置对象上触发 AttributeError，使用 hasattr 检查会更安全。此外，由于 config 对象通常在多个层之间共享，修改此属性可能会产生全局影响，请确认这是否符合预期。

Suggested change

elif hasattr(out_norm, 'config'):

out_norm.config.layernorm_zero_centered_gamma = False

elif hasattr(out_norm, 'config') and hasattr(out_norm.config, 'layernorm_zero_centered_gamma'):

out_norm.config.layernorm_zero_centered_gamma = False

Jintao-Huang · 2026-05-26T02:50:39Z

哈咯！请合并到main分支

我会手动cherry-pick过去

Weizhena · 2026-05-26T07:47:58Z

哈咯！请合并到main分支

我会手动cherry-pick过去

你好，已经放过去了

Jintao-Huang · 2026-05-26T08:13:17Z

+                if hasattr(out_norm, 'zero_centered_gamma'):
+                    out_norm.zero_centered_gamma = False
+                elif hasattr(out_norm, 'config'):
+                    out_norm.config.layernorm_zero_centered_gamma = False


config参数是共享的，会导致其他所有的 config.layernorm_zero_centered_gamma都为False。

但我记得只有 out_norm 需要是 layernorm_zero_centered_gamma 为False

你好，感谢回复！现在希望qwen3.6-27b 训练，验证发现确实不对，现在采取
import copy
if hasattr(layer.self_attention, 'out_norm'):
out_norm = layer.self_attention.out_norm
if hasattr(out_norm, 'zero_centered_gamma'):
out_norm.zero_centered_gamma = False
elif hasattr(out_norm, 'config'):
out_norm.config = copy.copy(out_norm.config)
out_norm.config.layernorm_zero_centered_gamma = False
return model

这样改动是否能暂时正确运行？经过这样的改动方式，sft首轮loss从13调到了1.4

Jintao-Huang · 2026-05-26T09:47:42Z

#97

Jintao-Huang · 2026-05-26T09:48:12Z

哈咯我先单独提个PR 将conv1d的bug修复了

Jintao-Huang and others added 2 commits May 17, 2026 17:40

bump version

6a39584

fix: GDN F.conv1d fallback missing groups + out_norm NPU compatibility

4b2f4cd

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

Weizhena changed the base branch from release/1.4 to main May 26, 2026 07:47

Jintao-Huang reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: GDN F.conv1d fallback 缺少 groups 参数 + out_norm 昇腾 NPU 兼容#94

fix: GDN F.conv1d fallback 缺少 groups 参数 + out_norm 昇腾 NPU 兼容#94
Weizhena wants to merge 2 commits into
modelscope:mainfrom
Weizhena:fix/npu-gdn-conv1d-outnorm

Weizhena commented May 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

Jintao-Huang commented May 26, 2026 •

edited

Loading

Uh oh!

Weizhena commented May 26, 2026

Uh oh!

Jintao-Huang May 26, 2026

Uh oh!

Weizhena May 26, 2026 •

edited

Loading

Uh oh!

Jintao-Huang commented May 26, 2026

Uh oh!

Jintao-Huang commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		elif hasattr(out_norm, 'config'):
		out_norm.config.layernorm_zero_centered_gamma = False

Conversation

Weizhena commented May 25, 2026

背景

问题 1：F.conv1d fallback 缺少 groups 参数（gated_delta_net.py）

问题 2：zero_centered_gamma assert 在 MindSpeed RMSNorm 下失败（qwen3_next_gdn.py）

实测验证

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Weizhena commented May 26, 2026

Uh oh!

Jintao-Huang May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Weizhena May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented May 26, 2026

Uh oh!

Jintao-Huang commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

问题 1：F.conv1d fallback 缺少 `groups` 参数（gated_delta_net.py）

问题 2：`zero_centered_gamma` assert 在 MindSpeed RMSNorm 下失败（qwen3_next_gdn.py）

Jintao-Huang commented May 26, 2026 •

edited

Loading

Weizhena May 26, 2026 •

edited

Loading