Skip to content

Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs)#1323

Open
erictinkeredapps wants to merge 17 commits intoNVIDIA:mainfrom
Tinkered-Apps:fix-qwen3x-moe-nvfp4-export
Open

Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs)#1323
erictinkeredapps wants to merge 17 commits intoNVIDIA:mainfrom
Tinkered-Apps:fix-qwen3x-moe-nvfp4-export

Conversation

@erictinkeredapps
Copy link
Copy Markdown

@erictinkeredapps erictinkeredapps commented Apr 22, 2026

Summary

Four bugs prevent NVFP4 quantization from producing quantized weights for Qwen3.5/3.6 MoE models (and likely other fused MoE architectures using _QuantFusedExperts). All four produce silent failures — no errors, just bfloat16 output identical to the input model.

Test Environment

  • Model: Qwen3.6-35B-A3B (MoE, 256 experts, top-8 routing)
  • Hardware: NVIDIA DGX Spark (GB10, Blackwell)
  • ModelOpt: 0.45.0 dev (editable install)
  • Transformers: 5.5.4
  • Result: 20.5 GB NVFP4 output (down from 66 GB bfloat16), verified uint8 expert weights with float8_e4m3fn local scales + float32 global scales

Bug Details

Bug 1: is_multimodal_model() crashes on None architectures

File: modelopt/torch/export/model_utils.py
Models with config.architectures = None (common for fine-tuned checkpoints) crash when is_multimodal_model() iterates the list. One-line fix: or [] fallback.

Bug 2: (Usage issue, not a code bug — fixed in caller)

Bug 3: get_quantization_format() does not recognize _QuantFusedExperts

File: modelopt/torch/export/quant_utils.py
The function iterates weight_attr_names(module) which returns singular attribute names. _QuantFusedExperts modules use plural ModuleList quantizers (gate_up_proj_weight_quantizers.N), so the function returns None and the module is treated as unquantized. Added a pre-check for plural ModuleList quantizers before the singular loop.

Bug 4: NVFP4 config wildcards do not match plural quantizer names

File: modelopt/torch/quantization/config.py
_nvfp4_selective_quant_cfg() generates patterns like *mlp.experts*weight_quantizer (singular). _QuantFusedExperts creates quantizers named gate_up_proj_weight_quantizers.0 (plural + index). The fnmatch fails, quantizers never receive NVFP4 config, and 100% stay at disabled default. Added wildcard entries for both plural suffix patterns.

Bug 5: _process_quantized_modules elif order sends fused MoE to wrong export path

File: modelopt/torch/export/unified_export_hf.py
Two elif branches: one checks type name ("Llama4TextExperts" in type().__name__), the other checks hasattr("gate_up_proj_weight_quantizers"). After _QuantFusedExperts wrapping, QuantQwen3_5MoeExperts matches the type-name branch, which calls _export_quantized_weight() looking for singular attributes → AttributeError. Swapped the elif order so the plural-attribute check runs first.

Changes

File Change
modelopt/torch/export/model_utils.py +1 line: or [] fallback
modelopt/torch/export/quant_utils.py +19 lines: plural ModuleList check
modelopt/torch/export/unified_export_hf.py elif reorder (17 lines changed)
modelopt/torch/quantization/config.py +6 lines: plural wildcard patterns

Summary by CodeRabbit

  • Bug Fixes

    • More robust model architecture detection when config fields are missing or falsy.
    • Avoids failures when optional attention components are absent.
    • Handles empty tensors during quantization to prevent downstream errors.
  • New Features

    • Better support for fused MoE (expert) modules in quantization detection and export.
    • Quantization config now recognizes plural-style expert weight quantizers.
    • Exporter extended for additional Megatron variants and a new model mapping.

Four bugs prevent NVFP4 export from producing quantized weights for
Qwen3.5/3.6 MoE models (and potentially other fused MoE architectures).
All produce silent failures — no errors, just bfloat16 output identical
to input.

Bug 1: is_multimodal_model() crashes when config.architectures is None
  - model_utils.py: add 'or []' fallback for NoneType iteration

Bug 3: get_quantization_format() doesn't recognize _QuantFusedExperts
  - quant_utils.py: add check for plural ModuleList quantizers
    (gate_up_proj_weight_quantizers, down_proj_weight_quantizers)
    before the singular weight_quantizer loop

Bug 4: NVFP4 config wildcards don't match plural quantizer names
  - config.py: _nvfp4_selective_quant_cfg() only generates patterns
    for singular 'weight_quantizer', but _QuantFusedExperts creates
    plural ModuleList quantizers. Add wildcard entries for both
    gate_up_proj_weight_quantizers* and down_proj_weight_quantizers*

Bug 5: _process_quantized_modules elif order sends fused MoE to wrong path
  - unified_export_hf.py: swap elif branches so hasattr check for
    gate_up_proj_weight_quantizers comes before type-name checks.
    Without this, QuantQwen3_5MoeExperts hits the singular-attribute
    branch and crashes with AttributeError

Tested on: Qwen3.6-35B-A3B (MoE), NVIDIA DGX Spark (GB10),
modelopt 0.45.0 dev, transformers 5.5.4
Output: 20.5 GB NVFP4 (down from 66 GB bfloat16)
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updates export/quantization to detect and handle fused Mixture-of-Experts (MoE) modules: normalizes multimodal detection, adds plural-quantizer NVFP4 detection, reorders fused-expert export, extends NVFP4 config patterns, and registers a Megatron↔HF mapping.

Changes

Cohort / File(s) Summary
Multimodal Model Detection
modelopt/torch/export/model_utils.py
Normalize config.architectures in is_multimodal_model() to getattr(config, "architectures", []) or [] so downstream iteration is safe when the attribute is missing or falsey.
Fused Expert Quantization Detection
modelopt/torch/export/quant_utils.py
Add pre-check in get_quantization_format() to inspect plural ModuleList quantizer attrs (gate_up_proj_weight_quantizers, down_proj_weight_quantizers), pick a representative enabled quantizer (fallback first), extract num_bits and block_sizes.scale_bits, and detect NVFP4 pattern.
NVFP4 Selective Config Generation
modelopt/torch/quantization/config.py
Extend _nvfp4_selective_quant_cfg() to also append wildcard-matched entries for plural quantizer attribute names ({pattern}gate_up_proj_weight_quantizers*, {pattern}down_proj_weight_quantizers*) alongside {pattern}weight_quantizer.
Export Control Flow Reordering
modelopt/torch/export/unified_export_hf.py
Reorder _process_quantized_modules() so modules with gate_up_proj_weight_quantizers are handled earlier and exported via _export_fused_experts(..., reshard=False) before specialized expert-type branches.
Megatron ↔ HF Mapping Update
modelopt/torch/export/plugins/mcore_common.py
Register "Qwen3_5MoeForConditionalGeneration" in both export/import mapping dicts, routing to existing qwen3_causal_lm_export / qwen3_causal_lm_import handlers.
Megatron Exporter Robustness & Types
modelopt/torch/export/unified_export_megatron.py
Accept additional Megatron model classes (HybridModel, Qwen3VLModel), add guarded handling for GatedDeltaNet attention export, and add attribute-existence guards (e.g., q_layernorm, core_attention) to avoid attribute errors.
Qwen Plugin Mappings
modelopt/torch/export/plugins/mcore_qwen.py
Extend qwen3_causal_lm_export mapping to include MoE shared_experts weights (shared_experts.linear_fc1, shared_experts.linear_fc2) and GatedDeltaNet attention remaps (gated_delta_net_in_proj, gated_delta_net_out_norm).
NVFP4 QTensor Empty-shard Handling
modelopt/torch/quantization/qtensor/nvfp4_tensor.py
Make NVFP4QTensor.quantize return early for empty input tensors, producing properly shaped zero scales and an empty quantized tensor to avoid downstream shape assumptions.

Sequence Diagram(s)

sequenceDiagram
    participant Exporter as Exporter
    participant Scanner as Module Scanner
    participant QuantDetector as Quant format detector
    participant FusedExporter as Fused Experts Exporter
    participant NVFP4Cfg as NVFP4 Config Generator

    Exporter->>Scanner: enumerate sub-modules
    Scanner->>QuantDetector: inspect module attributes
    alt module has plural quantizers
        QuantDetector->>QuantDetector: pick enabled/first quantizer, read num_bits & scale_bits
        QuantDetector-->>Exporter: return QUANTIZATION_NVFP4
        Exporter->>FusedExporter: call _export_fused_experts(..., reshard=False)
    else other quantized module
        QuantDetector-->>Exporter: return other quant format
        Exporter->>Exporter: follow existing export branches
    end
    Exporter->>NVFP4Cfg: request selective NVFP4 patterns
    NVFP4Cfg-->>Exporter: include plural-quantizer wildcard patterns
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing NVFP4 quantization issues (described as 4 silent-failure bugs) for Qwen3.x MoE models, which is the primary focus of all file changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Comprehensive audit of all eight modified files found no instances of unsafe deserialization, hardcoded trust_remote_code, eval/exec on untrusted input, nosec comments, or new non-permissive dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/quant_utils.py`:
- Around line 673-685: The code currently only inspects quantizer_list[0] to
detect NVFP4, which misses cases where expert 0 is disabled; update the logic in
the detection block to iterate over quantizer_list and find the first quantizer
q where hasattr(q, "is_enabled") and q.is_enabled (or otherwise any enabled
quantizer), then read num_bits, block_sizes and compute scale_bits from that
enabled q and return QUANTIZATION_NVFP4 when matching (num_bits == (2, 1) and
scale_bits == (4, 3)); ensure the fallback to QUANTIZATION_NONE only happens
after checking all quantizers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f0da2041-19c7-4d5f-b7a8-ecfb8c942950

📥 Commits

Reviewing files that changed from the base of the PR and between e56682e and 1b1fced.

📒 Files selected for processing (4)
  • modelopt/torch/export/model_utils.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/quantization/config.py

Comment thread modelopt/torch/export/quant_utils.py Outdated
CodeRabbit review: expert 0 may be disabled when uncalibrated, so
checking only quantizer_list[0] can miss the actual NVFP4 config.
Now iterates to find the first enabled quantizer in the list.
…nce check

- Add Qwen3_5MoeForConditionalGeneration to export/import mappings
- Add Qwen3VLModel + HybridModel to GPTModelExporter isinstance check
- Handle GatedDeltaNet layers in _get_transformer_layer_state_dict
- Fix quantizer format detection for disabled quantizers
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/export/unified_export_megatron.py (1)

465-494: ⚠️ Potential issue | 🟠 Major

Guard the new softmax_offset export path.

This branch indexes self.rules["softmax_offset"] unconditionally once the attribute exists. If the active rule book does not define that key for a supported attention variant, export will raise KeyError even though the rest of the path is valid.

🔧 Suggested fix
-                if hasattr(layer.self_attention, "core_attention") and getattr(layer.self_attention.core_attention, "softmax_offset", None) is not None:
-                    self.rules["softmax_offset"](
+                if (
+                    "softmax_offset" in self.rules
+                    and hasattr(layer.self_attention, "core_attention")
+                    and getattr(layer.self_attention.core_attention, "softmax_offset", None) is not None
+                ):
+                    self.rules["softmax_offset"](
                         layer.self_attention.core_attention.softmax_offset, layer_id
                     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/unified_export_megatron.py` around lines 465 - 494, The
code unconditionally accesses self.rules["softmax_offset"] when
layer.self_attention.core_attention has softmax_offset, which can raise KeyError
if the active rule book lacks that key; update the final branch that currently
checks getattr(layer.self_attention.core_attention, "softmax_offset", None) to
also verify "softmax_offset" in self.rules (and optionally that it's callable)
before invoking
self.rules["softmax_offset"](layer.self_attention.core_attention.softmax_offset,
layer_id), keeping the existing hasattr/getattr checks for
core_attention/softmax_offset and referencing the symbols self.rules,
"softmax_offset", layer.self_attention.core_attention, and layer_id.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 465-494: The code unconditionally accesses
self.rules["softmax_offset"] when layer.self_attention.core_attention has
softmax_offset, which can raise KeyError if the active rule book lacks that key;
update the final branch that currently checks
getattr(layer.self_attention.core_attention, "softmax_offset", None) to also
verify "softmax_offset" in self.rules (and optionally that it's callable) before
invoking
self.rules["softmax_offset"](layer.self_attention.core_attention.softmax_offset,
layer_id), keeping the existing hasattr/getattr checks for
core_attention/softmax_offset and referencing the symbols self.rules,
"softmax_offset", layer.self_attention.core_attention, and layer_id.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 87ec7afa-9907-4194-886c-93fccd2f666d

📥 Commits

Reviewing files that changed from the base of the PR and between 5d5c492 and c1e09d8.

📒 Files selected for processing (3)
  • modelopt/torch/export/plugins/mcore_common.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/export/unified_export_megatron.py
✅ Files skipped from review due to trivial changes (1)
  • modelopt/torch/export/plugins/mcore_common.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 465-476: The GatedDeltaNet branch can silently skip applying
required rules and produce incomplete checkpoints; update the handling in
unified_export_megatron.py so that when "GatedDeltaNet" is detected on
layer.self_attention you assert/raise an explicit error if the required rule
keys are missing (e.g., "gated_delta_net_in_proj" and "linear_proj"), and also
raise if out_norm exists and is not IdentityOp but "gated_delta_net_out_norm" is
absent; keep the existing calls to
self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id),
self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id),
and self.rules["linear_proj"](layer.self_attention.out_proj, layer_id) but guard
them with clear exceptions mentioning the missing rule names.
- Line 126: The multimodal detection currently only checks for LLaVAModel,
causing Qwen3VLModel to be treated like a unimodal model; update the two
is_multimodal checks that reference LLaVAModel so they also include Qwen3VLModel
(so Qwen3VLModel is detected as multimodal), ensuring the export path uses
model.language_model where intended and that multimodal components are loaded
(the same checks that gate using model.language_model and loading multimodal
parts should include Qwen3VLModel).

In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py`:
- Around line 289-295: The empty-tensor branch must preserve the same return
contract as the non-empty path and avoid running reductions on empty inputs:
move the early-return check on input.numel() to before any calls to
reduce_block_amax() or other reductions; when empty, compute num_blocks =
ceil_div(input.shape[-1], BLOCK_SIZE) (same logic used elsewhere) and return the
same tuple structure as the non-empty paths—i.e., if keep_high_precision is True
return the high-precision payload only (matching the branch that returns at line
~310), otherwise return (cls(input_shape, input_dtype, input), a properly-shaped
packed uint8 payload filled with zeros matching the packed layout dequantize()
expects, and a scale tensor shaped (*input.shape[:-1], num_blocks) filled with
zeros); ensure the payload dtype/format matches dequantize()'s bitwise unpacking
(packed uint8 with low/high nibbles), and use the same device/dtype conventions
as reduce_block_amax() and dequantize().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a211fed2-1afc-4bf1-b1a1-bcd8e4e8a759

📥 Commits

Reviewing files that changed from the base of the PR and between 0ddd356 and 344badb.

📒 Files selected for processing (3)
  • modelopt/torch/export/plugins/mcore_qwen.py
  • modelopt/torch/export/unified_export_megatron.py
  • modelopt/torch/quantization/qtensor/nvfp4_tensor.py

):
"""Create a GPTModel exporter instance."""
if not isinstance(model, (GPTModel, MambaModel, LLaVAModel)):
if not isinstance(model, (GPTModel, MambaModel, HybridModel, LLaVAModel, Qwen3VLModel)):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify accepted model classes vs multimodal routing logic in this exporter.
rg -n -C2 'Qwen3VLModel|isinstance\(model, \(GPTModel, MambaModel, HybridModel, LLaVAModel, Qwen3VLModel\)\)|is_multimodal = isinstance\(model, LLaVAModel\)|self\.model = model\.language_model if self\.is_multimodal else model|load_multimodal_components' modelopt/torch/export/unified_export_megatron.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1923


Qwen3VLModel is accepted but not routed through multimodal export handling.

Line 126 accepts Qwen3VLModel in the isinstance check, but multimodal detection at lines 151 and 157 only checks for LLaVAModel. This causes Qwen3VLModel to skip multimodal export logic: line 158 will use the full model instead of model.language_model, and line 349 will not load multimodal components. The export will be incomplete/incorrect.

Update both is_multimodal checks to include Qwen3VLModel:

Suggested fix
-        self.is_multimodal = isinstance(model, LLaVAModel)
+        self.is_multimodal = isinstance(model, (LLaVAModel, Qwen3VLModel))

Apply this change at both line 151 and line 157.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/unified_export_megatron.py` at line 126, The multimodal
detection currently only checks for LLaVAModel, causing Qwen3VLModel to be
treated like a unimodal model; update the two is_multimodal checks that
reference LLaVAModel so they also include Qwen3VLModel (so Qwen3VLModel is
detected as multimodal), ensuring the export path uses model.language_model
where intended and that multimodal components are loaded (the same checks that
gate using model.language_model and loading multimodal parts should include
Qwen3VLModel).

Comment on lines +465 to +476
elif "GatedDeltaNet" in str(type(layer.self_attention)):
# GatedDeltaNet (linear attention) has in_proj, out_norm, out_proj
# instead of linear_qkv, q_layernorm, etc.
# Use dedicated GDN rules if available (no QKV slicing), else skip.
if "gated_delta_net_in_proj" in self.rules:
self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id)
if hasattr(layer.self_attention, "out_norm") and not isinstance(
layer.self_attention.out_norm, IdentityOp
):
if "gated_delta_net_out_norm" in self.rules:
self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id)
self.rules["linear_proj"](layer.self_attention.out_proj, layer_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when required GatedDeltaNet rules are missing.

This branch currently skips in_proj / out_norm when rule keys are absent, which can silently emit incomplete checkpoints. Please raise an explicit error instead of skipping.

Suggested fix
             elif "GatedDeltaNet" in str(type(layer.self_attention)):
                 # GatedDeltaNet (linear attention) has in_proj, out_norm, out_proj
                 # instead of linear_qkv, q_layernorm, etc.
-                # Use dedicated GDN rules if available (no QKV slicing), else skip.
-                if "gated_delta_net_in_proj" in self.rules:
-                    self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id)
+                required_rules = ["linear_proj", "gated_delta_net_in_proj"]
+                missing_rules = [r for r in required_rules if r not in self.rules]
+                if missing_rules:
+                    raise KeyError(
+                        f"Missing required export rule(s) for GatedDeltaNet at layer {layer_id}: {missing_rules}"
+                    )
+
+                self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id)
                 if hasattr(layer.self_attention, "out_norm") and not isinstance(
                     layer.self_attention.out_norm, IdentityOp
                 ):
-                    if "gated_delta_net_out_norm" in self.rules:
-                        self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id)
+                    if "gated_delta_net_out_norm" not in self.rules:
+                        raise KeyError(
+                            f"Missing required export rule 'gated_delta_net_out_norm' for GatedDeltaNet at layer {layer_id}"
+                        )
+                    self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id)
                 self.rules["linear_proj"](layer.self_attention.out_proj, layer_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/unified_export_megatron.py` around lines 465 - 476, The
GatedDeltaNet branch can silently skip applying required rules and produce
incomplete checkpoints; update the handling in unified_export_megatron.py so
that when "GatedDeltaNet" is detected on layer.self_attention you assert/raise
an explicit error if the required rule keys are missing (e.g.,
"gated_delta_net_in_proj" and "linear_proj"), and also raise if out_norm exists
and is not IdentityOp but "gated_delta_net_out_norm" is absent; keep the
existing calls to
self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id),
self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id),
and self.rules["linear_proj"](layer.self_attention.out_proj, layer_id) but guard
them with clear exceptions mentioning the missing rule names.

Comment on lines +289 to +295
# Handle empty tensors (e.g. from TP/EP sharding where this rank has no slice)
if input.numel() == 0:
return (
cls(input_shape, input_dtype, input),
torch.zeros(*input.shape[:-1], device=input.device, dtype=torch.float8_e4m3fn),
torch.zeros(1, device=input.device, dtype=torch.float32),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Locate reduce_amax/reduce_block_amax implementations =="
rg -n -C3 'def reduce_amax|def reduce_block_amax' modelopt/torch/quantization

echo
echo "== Inspect NVFP4 quantize callers and whether weights_scaling_factor_2 may be None =="
rg -n -C3 'NVFP4QTensor\.quantize\(' modelopt/torch

echo
echo "== Inspect keep_high_precision call sites/expectations =="
rg -n -C3 'keep_high_precision\s*=\s*True|keep_high_precision=True' modelopt/torch

echo
echo "== Inspect dequantize unpack assumptions (bitwise on quantized payload) =="
rg -n -C3 '_unpack_tensor|>> 4|& 0x0F' modelopt/torch/quantization/qtensor/nvfp4_tensor.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2895


🏁 Script executed:

# Get the full quantize method signature and implementation
cat -n modelopt/torch/quantization/qtensor/nvfp4_tensor.py | sed -n '200,350p'

Repository: NVIDIA/Model-Optimizer

Length of output: 7218


🏁 Script executed:

# Search for the quantize method definition and look for keep_high_precision parameter
rg -n 'def quantize' modelopt/torch/quantization/qtensor/nvfp4_tensor.py -A 20

Repository: NVIDIA/Model-Optimizer

Length of output: 1081


🏁 Script executed:

# Check the dequantize implementation to understand what format it expects
rg -n 'def dequantize' modelopt/torch/quantization/qtensor/nvfp4_tensor.py -A 30

Repository: NVIDIA/Model-Optimizer

Length of output: 1538


🏁 Script executed:

# Check if keep_high_precision is used anywhere in the file
rg -n 'keep_high_precision' modelopt/torch/quantization/qtensor/nvfp4_tensor.py

Repository: NVIDIA/Model-Optimizer

Length of output: 583


🏁 Script executed:

# Verify the return type contract by checking what the caller expects
rg -n 'NVFP4QTensor.quantize' modelopt/torch/quantization/nn/modules/tensor_quantizer.py -B 5 -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 949


🏁 Script executed:

# Check if keep_high_precision=True is ever passed to quantize
rg -n 'keep_high_precision\s*=\s*True' modelopt/torch/quantization

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Examine get_weights_scaling_factor to understand return shape
rg -n 'def get_weights_scaling_factor' modelopt/torch/quantization/qtensor/nvfp4_tensor.py -A 20

Repository: NVIDIA/Model-Optimizer

Length of output: 3723


🏁 Script executed:

# Get the complete get_weights_scaling_factor method
sed -n '138,173p' modelopt/torch/quantization/qtensor/nvfp4_tensor.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1548


🏁 Script executed:

# Also check reduce_block_amax to see what shape/dtype it produces
rg -n 'def reduce_block_amax' modelopt/torch/quantization/utils/core_utils.py -A 25

Repository: NVIDIA/Model-Optimizer

Length of output: 1198


🏁 Script executed:

# Check if the keep_high_precision return is ever consumed (any code testing the return type)
rg -n 'scaled_weight' modelopt/torch/quantization/qtensor/nvfp4_tensor.py

Repository: NVIDIA/Model-Optimizer

Length of output: 256


Empty-tensor fast path breaks NVFP4 quantization contract

The empty-tensor branch at lines 289-295 violates three critical contracts:

  1. Type contract violation: When keep_high_precision=True (line 228), the method returns a single tensor (line 310), but the empty branch always returns a 3-tuple. This causes a type mismatch if keep_high_precision=True is ever passed.

  2. Data format contract violation: Line 292 stores the padded float tensor directly as quantized payload, but dequantize() (lines 332-333) uses bitwise unpacking (>> 4 and & 0x0F) that expects packed uint8 data. Passing raw float data will cause unpacking to fail.

  3. Scale shape contract violation: The normal path produces per-block scales with shape (*input.shape[:-1], num_blocks) via reduce_block_amax(). Line 293 returns torch.zeros(*input.shape[:-1], ...) with shape (*input.shape[:-1],), omitting the block dimension. This breaks downstream code expecting per-block scales.

Additionally, the empty check at line 290 is placed too late—reductions at lines 251 and 285-286 execute on empty tensors before the early return, potentially producing NaN or 0 values.

Suggested fix (preserve contract for empty tensors)
+        # Handle empty tensors early (e.g. TP/EP ranks with no slice)
+        if input.numel() == 0:
+            if keep_high_precision:
+                return input
+
+            # Keep quantized contract: packed uint8 payload + per-block scales
+            packed_weight = torch.empty(
+                (*input.shape[:-1], input.shape[-1] // 2),
+                device=input.device,
+                dtype=torch.uint8,
+            )
+            per_block_scale = torch.empty(
+                (*input.shape[:-1], input.shape[-1] // block_size),
+                device=input.device,
+                dtype=torch.float8_e4m3fn,
+            )
+            if weights_scaling_factor_2 is None:
+                weights_scaling_factor_2 = torch.zeros(
+                    1, device=input.device, dtype=torch.float32
+                )
+            return (
+                cls(input_shape, input_dtype, packed_weight),
+                per_block_scale,
+                weights_scaling_factor_2,
+            )
+
         if weights_scaling_factor_2 is None:
             weights_scaling_factor_2 = cls.get_weights_scaling_factor_2(input)

         # try call trtllm fp4 quantization if possible
         if (
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py` around lines 289 - 295,
The empty-tensor branch must preserve the same return contract as the non-empty
path and avoid running reductions on empty inputs: move the early-return check
on input.numel() to before any calls to reduce_block_amax() or other reductions;
when empty, compute num_blocks = ceil_div(input.shape[-1], BLOCK_SIZE) (same
logic used elsewhere) and return the same tuple structure as the non-empty
paths—i.e., if keep_high_precision is True return the high-precision payload
only (matching the branch that returns at line ~310), otherwise return
(cls(input_shape, input_dtype, input), a properly-shaped packed uint8 payload
filled with zeros matching the packed layout dequantize() expects, and a scale
tensor shaped (*input.shape[:-1], num_blocks) filled with zeros); ensure the
payload dtype/format matches dequantize()'s bitwise unpacking (packed uint8 with
low/high nibbles), and use the same device/dtype conventions as
reduce_block_amax() and dequantize().

@lennytinkeredapps lennytinkeredapps force-pushed the fix-qwen3x-moe-nvfp4-export branch from a12351d to f1e2944 Compare April 29, 2026 06:17
Three bugs fixed for multi-rank EP MoE export (Qwen3.6-35B-A3B):

1. Format string bug: fc1/fc2 prefix has two {} placeholders (layer_id, expert_id).
   Using .format(layer_id) fails. Fixed with re.sub to fill only first {}.

2. Expert offset bug: _grouped_mlp_slicing had no EP rank awareness. Both ranks
   wrote experts 0-127 with overlapping keys. Added expert_offset param from
   get_expert_model_parallel_rank() * num_local_experts.

3. weight_key bug: used global expert_id for module lookup instead of local_expert_id.
   Module has weight0..weight127, not weight128..weight255.

4. Save strategy: all_gather_object causes OOM (pickle overhead on ~40k tensors).
   Each rank now writes to separate NFS dir, then rank 0 merges safetensors
   shard-by-shard with low memory footprint.
… dispatch

- Add GroupedGatedMLPSlicing class to mcore_custom.py for TEGroupedMLP gate/up split
- Add _grouped_gated_mlp_slicing method to GPTModelExporter
- Clone shared-storage tensors before safetensors save (NVFP4 weight_scale broadcast)
- Dispatch fc1 slicing based on mapping func_name for correct expert handling
…n_hidden_size

Both GatedMLPSlicing and GroupedGatedMLPSlicing used module.config.ffn_hidden_size
to determine the gate/up split point. For MoE models, ffn_hidden_size is often set
to hidden_size (2048) rather than the per-expert intermediate size (512), causing
gate_proj to receive the full fused weight and up_proj to be empty [0, N].

Now derives gated_split from the actual weight tensor shape (rows // 2).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants