Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs) by erictinkeredapps · Pull Request #1323 · NVIDIA/Model-Optimizer

erictinkeredapps · 2026-04-22T18:36:50Z

Summary

Four bugs prevent NVFP4 quantization from producing quantized weights for Qwen3.5/3.6 MoE models (and likely other fused MoE architectures using _QuantFusedExperts). All four produce silent failures — no errors, just bfloat16 output identical to the input model.

Test Environment

Model: Qwen3.6-35B-A3B (MoE, 256 experts, top-8 routing)
Hardware: NVIDIA DGX Spark (GB10, Blackwell)
ModelOpt: 0.45.0 dev (editable install)
Transformers: 5.5.4
Result: 20.5 GB NVFP4 output (down from 66 GB bfloat16), verified uint8 expert weights with float8_e4m3fn local scales + float32 global scales

Bug Details

Bug 1: `is_multimodal_model()` crashes on `None` architectures

File: modelopt/torch/export/model_utils.py
Models with config.architectures = None (common for fine-tuned checkpoints) crash when is_multimodal_model() iterates the list. One-line fix: or [] fallback.

Bug 2: (Usage issue, not a code bug — fixed in caller)

Bug 3: `get_quantization_format()` does not recognize `_QuantFusedExperts`

File: modelopt/torch/export/quant_utils.py
The function iterates weight_attr_names(module) which returns singular attribute names. _QuantFusedExperts modules use plural ModuleList quantizers (gate_up_proj_weight_quantizers.N), so the function returns None and the module is treated as unquantized. Added a pre-check for plural ModuleList quantizers before the singular loop.

Bug 4: NVFP4 config wildcards do not match plural quantizer names

File: modelopt/torch/quantization/config.py
_nvfp4_selective_quant_cfg() generates patterns like *mlp.experts*weight_quantizer (singular). _QuantFusedExperts creates quantizers named gate_up_proj_weight_quantizers.0 (plural + index). The fnmatch fails, quantizers never receive NVFP4 config, and 100% stay at disabled default. Added wildcard entries for both plural suffix patterns.

Bug 5: `_process_quantized_modules` elif order sends fused MoE to wrong export path

File: modelopt/torch/export/unified_export_hf.py
Two elif branches: one checks type name ("Llama4TextExperts" in type().__name__), the other checks hasattr("gate_up_proj_weight_quantizers"). After _QuantFusedExperts wrapping, QuantQwen3_5MoeExperts matches the type-name branch, which calls _export_quantized_weight() looking for singular attributes → AttributeError. Swapped the elif order so the plural-attribute check runs first.

Changes

File	Change
`modelopt/torch/export/model_utils.py`	+1 line: `or []` fallback
`modelopt/torch/export/quant_utils.py`	+19 lines: plural ModuleList check
`modelopt/torch/export/unified_export_hf.py`	elif reorder (17 lines changed)
`modelopt/torch/quantization/config.py`	+6 lines: plural wildcard patterns

Summary by CodeRabbit

Bug Fixes
- More robust model architecture detection when config fields are missing or falsy.
- Avoids failures when optional attention components are absent.
- Handles empty tensors during quantization to prevent downstream errors.
New Features
- Better support for fused MoE (expert) modules in quantization detection and export.
- Quantization config now recognizes plural-style expert weight quantizers.
- Exporter extended for additional Megatron variants and a new model mapping.

Four bugs prevent NVFP4 export from producing quantized weights for Qwen3.5/3.6 MoE models (and potentially other fused MoE architectures). All produce silent failures — no errors, just bfloat16 output identical to input. Bug 1: is_multimodal_model() crashes when config.architectures is None - model_utils.py: add 'or []' fallback for NoneType iteration Bug 3: get_quantization_format() doesn't recognize _QuantFusedExperts - quant_utils.py: add check for plural ModuleList quantizers (gate_up_proj_weight_quantizers, down_proj_weight_quantizers) before the singular weight_quantizer loop Bug 4: NVFP4 config wildcards don't match plural quantizer names - config.py: _nvfp4_selective_quant_cfg() only generates patterns for singular 'weight_quantizer', but _QuantFusedExperts creates plural ModuleList quantizers. Add wildcard entries for both gate_up_proj_weight_quantizers* and down_proj_weight_quantizers* Bug 5: _process_quantized_modules elif order sends fused MoE to wrong path - unified_export_hf.py: swap elif branches so hasattr check for gate_up_proj_weight_quantizers comes before type-name checks. Without this, QuantQwen3_5MoeExperts hits the singular-attribute branch and crashes with AttributeError Tested on: Qwen3.6-35B-A3B (MoE), NVIDIA DGX Spark (GB10), modelopt 0.45.0 dev, transformers 5.5.4 Output: 20.5 GB NVFP4 (down from 66 GB bfloat16)

copy-pr-bot · 2026-04-22T18:36:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-22T18:39:32Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Updates export/quantization to detect and handle fused Mixture-of-Experts (MoE) modules: normalizes multimodal detection, adds plural-quantizer NVFP4 detection, reorders fused-expert export, extends NVFP4 config patterns, and registers a Megatron↔HF mapping.

Changes

Cohort / File(s)	Summary
Multimodal Model Detection `modelopt/torch/export/model_utils.py`	Normalize `config.architectures` in `is_multimodal_model()` to `getattr(config, "architectures", []) or []` so downstream iteration is safe when the attribute is missing or falsey.
Fused Expert Quantization Detection `modelopt/torch/export/quant_utils.py`	Add pre-check in `get_quantization_format()` to inspect plural `ModuleList` quantizer attrs (`gate_up_proj_weight_quantizers`, `down_proj_weight_quantizers`), pick a representative enabled quantizer (fallback first), extract `num_bits` and `block_sizes.scale_bits`, and detect NVFP4 pattern.
NVFP4 Selective Config Generation `modelopt/torch/quantization/config.py`	Extend `_nvfp4_selective_quant_cfg()` to also append wildcard-matched entries for plural quantizer attribute names (`{pattern}gate_up_proj_weight_quantizers`, `{pattern}down_proj_weight_quantizers`) alongside `{pattern}weight_quantizer`.
Export Control Flow Reordering `modelopt/torch/export/unified_export_hf.py`	Reorder `_process_quantized_modules()` so modules with `gate_up_proj_weight_quantizers` are handled earlier and exported via `_export_fused_experts(..., reshard=False)` before specialized expert-type branches.
Megatron ↔ HF Mapping Update `modelopt/torch/export/plugins/mcore_common.py`	Register `"Qwen3_5MoeForConditionalGeneration"` in both export/import mapping dicts, routing to existing `qwen3_causal_lm_export` / `qwen3_causal_lm_import` handlers.
Megatron Exporter Robustness & Types `modelopt/torch/export/unified_export_megatron.py`	Accept additional Megatron model classes (`HybridModel`, `Qwen3VLModel`), add guarded handling for `GatedDeltaNet` attention export, and add attribute-existence guards (e.g., `q_layernorm`, `core_attention`) to avoid attribute errors.
Qwen Plugin Mappings `modelopt/torch/export/plugins/mcore_qwen.py`	Extend `qwen3_causal_lm_export` mapping to include MoE `shared_experts` weights (`shared_experts.linear_fc1`, `shared_experts.linear_fc2`) and GatedDeltaNet attention remaps (`gated_delta_net_in_proj`, `gated_delta_net_out_norm`).
NVFP4 QTensor Empty-shard Handling `modelopt/torch/quantization/qtensor/nvfp4_tensor.py`	Make `NVFP4QTensor.quantize` return early for empty input tensors, producing properly shaped zero scales and an empty quantized tensor to avoid downstream shape assumptions.

Sequence Diagram(s)

sequenceDiagram
    participant Exporter as Exporter
    participant Scanner as Module Scanner
    participant QuantDetector as Quant format detector
    participant FusedExporter as Fused Experts Exporter
    participant NVFP4Cfg as NVFP4 Config Generator

    Exporter->>Scanner: enumerate sub-modules
    Scanner->>QuantDetector: inspect module attributes
    alt module has plural quantizers
        QuantDetector->>QuantDetector: pick enabled/first quantizer, read num_bits & scale_bits
        QuantDetector-->>Exporter: return QUANTIZATION_NVFP4
        Exporter->>FusedExporter: call _export_fused_experts(..., reshard=False)
    else other quantized module
        QuantDetector-->>Exporter: return other quant format
        Exporter->>Exporter: follow existing export branches
    end
    Exporter->>NVFP4Cfg: request selective NVFP4 patterns
    NVFP4Cfg-->>Exporter: include plural-quantizer wildcard patterns

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: fixing NVFP4 quantization issues (described as 4 silent-failure bugs) for Qwen3.x MoE models, which is the primary focus of all file changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Comprehensive audit of all eight modified files found no instances of unsafe deserialization, hardcoded trust_remote_code, eval/exec on untrusted input, nosec comments, or new non-permissive dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/quant_utils.py`:
- Around line 673-685: The code currently only inspects quantizer_list[0] to
detect NVFP4, which misses cases where expert 0 is disabled; update the logic in
the detection block to iterate over quantizer_list and find the first quantizer
q where hasattr(q, "is_enabled") and q.is_enabled (or otherwise any enabled
quantizer), then read num_bits, block_sizes and compute scale_bits from that
enabled q and return QUANTIZATION_NVFP4 when matching (num_bits == (2, 1) and
scale_bits == (4, 3)); ensure the fallback to QUANTIZATION_NONE only happens
after checking all quantizers.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f0da2041-19c7-4d5f-b7a8-ecfb8c942950

📥 Commits

Reviewing files that changed from the base of the PR and between e56682e and 1b1fced.

📒 Files selected for processing (4)

modelopt/torch/export/model_utils.py
modelopt/torch/export/quant_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/config.py

CodeRabbit review: expert 0 may be disabled when uncalibrated, so checking only quantizer_list[0] can miss the actual NVFP4 config. Now iterates to find the first enabled quantizer in the list.

…nce check - Add Qwen3_5MoeForConditionalGeneration to export/import mappings - Add Qwen3VLModel + HybridModel to GPTModelExporter isinstance check - Handle GatedDeltaNet layers in _get_transformer_layer_state_dict - Fix quantizer format detection for disabled quantizers

…brid_model)

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/export/unified_export_megatron.py (1)

465-494: ⚠️ Potential issue | 🟠 Major

Guard the new softmax_offset export path.

This branch indexes self.rules["softmax_offset"] unconditionally once the attribute exists. If the active rule book does not define that key for a supported attention variant, export will raise KeyError even though the rest of the path is valid.

🔧 Suggested fix

-                if hasattr(layer.self_attention, "core_attention") and getattr(layer.self_attention.core_attention, "softmax_offset", None) is not None:
-                    self.rules["softmax_offset"](
+                if (
+                    "softmax_offset" in self.rules
+                    and hasattr(layer.self_attention, "core_attention")
+                    and getattr(layer.self_attention.core_attention, "softmax_offset", None) is not None
+                ):
+                    self.rules["softmax_offset"](
                         layer.self_attention.core_attention.softmax_offset, layer_id
                     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/unified_export_megatron.py` around lines 465 - 494, The
code unconditionally accesses self.rules["softmax_offset"] when
layer.self_attention.core_attention has softmax_offset, which can raise KeyError
if the active rule book lacks that key; update the final branch that currently
checks getattr(layer.self_attention.core_attention, "softmax_offset", None) to
also verify "softmax_offset" in self.rules (and optionally that it's callable)
before invoking
self.rules["softmax_offset"](layer.self_attention.core_attention.softmax_offset,
layer_id), keeping the existing hasattr/getattr checks for
core_attention/softmax_offset and referencing the symbols self.rules,
"softmax_offset", layer.self_attention.core_attention, and layer_id.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 465-494: The code unconditionally accesses
self.rules["softmax_offset"] when layer.self_attention.core_attention has
softmax_offset, which can raise KeyError if the active rule book lacks that key;
update the final branch that currently checks
getattr(layer.self_attention.core_attention, "softmax_offset", None) to also
verify "softmax_offset" in self.rules (and optionally that it's callable) before
invoking
self.rules["softmax_offset"](layer.self_attention.core_attention.softmax_offset,
layer_id), keeping the existing hasattr/getattr checks for
core_attention/softmax_offset and referencing the symbols self.rules,
"softmax_offset", layer.self_attention.core_attention, and layer_id.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 87ec7afa-9907-4194-886c-93fccd2f666d

📥 Commits

Reviewing files that changed from the base of the PR and between 5d5c492 and c1e09d8.

📒 Files selected for processing (3)

modelopt/torch/export/plugins/mcore_common.py
modelopt/torch/export/quant_utils.py
modelopt/torch/export/unified_export_megatron.py

✅ Files skipped from review due to trivial changes (1)

modelopt/torch/export/plugins/mcore_common.py

…to qwen3 mapping

…ro-slice)

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 465-476: The GatedDeltaNet branch can silently skip applying
required rules and produce incomplete checkpoints; update the handling in
unified_export_megatron.py so that when "GatedDeltaNet" is detected on
layer.self_attention you assert/raise an explicit error if the required rule
keys are missing (e.g., "gated_delta_net_in_proj" and "linear_proj"), and also
raise if out_norm exists and is not IdentityOp but "gated_delta_net_out_norm" is
absent; keep the existing calls to
self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id),
self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id),
and self.rules["linear_proj"](layer.self_attention.out_proj, layer_id) but guard
them with clear exceptions mentioning the missing rule names.
- Line 126: The multimodal detection currently only checks for LLaVAModel,
causing Qwen3VLModel to be treated like a unimodal model; update the two
is_multimodal checks that reference LLaVAModel so they also include Qwen3VLModel
(so Qwen3VLModel is detected as multimodal), ensuring the export path uses
model.language_model where intended and that multimodal components are loaded
(the same checks that gate using model.language_model and loading multimodal
parts should include Qwen3VLModel).

In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py`:
- Around line 289-295: The empty-tensor branch must preserve the same return
contract as the non-empty path and avoid running reductions on empty inputs:
move the early-return check on input.numel() to before any calls to
reduce_block_amax() or other reductions; when empty, compute num_blocks =
ceil_div(input.shape[-1], BLOCK_SIZE) (same logic used elsewhere) and return the
same tuple structure as the non-empty paths—i.e., if keep_high_precision is True
return the high-precision payload only (matching the branch that returns at line
~310), otherwise return (cls(input_shape, input_dtype, input), a properly-shaped
packed uint8 payload filled with zeros matching the packed layout dequantize()
expects, and a scale tensor shaped (*input.shape[:-1], num_blocks) filled with
zeros); ensure the payload dtype/format matches dequantize()'s bitwise unpacking
(packed uint8 with low/high nibbles), and use the same device/dtype conventions
as reduce_block_amax() and dequantize().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a211fed2-1afc-4bf1-b1a1-bcd8e4e8a759

📥 Commits

Reviewing files that changed from the base of the PR and between 0ddd356 and 344badb.

📒 Files selected for processing (3)

modelopt/torch/export/plugins/mcore_qwen.py
modelopt/torch/export/unified_export_megatron.py
modelopt/torch/quantization/qtensor/nvfp4_tensor.py

coderabbitai · 2026-04-29T04:41:11Z

    ):
        """Create a GPTModel exporter instance."""
-        if not isinstance(model, (GPTModel, MambaModel, LLaVAModel)):
+        if not isinstance(model, (GPTModel, MambaModel, HybridModel, LLaVAModel, Qwen3VLModel)):


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify accepted model classes vs multimodal routing logic in this exporter. rg -n -C2 'Qwen3VLModel|isinstance\(model, \(GPTModel, MambaModel, HybridModel, LLaVAModel, Qwen3VLModel\)\)|is_multimodal = isinstance\(model, LLaVAModel\)|self\.model = model\.language_model if self\.is_multimodal else model|load_multimodal_components' modelopt/torch/export/unified_export_megatron.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1923

Qwen3VLModel is accepted but not routed through multimodal export handling.

Line 126 accepts Qwen3VLModel in the isinstance check, but multimodal detection at lines 151 and 157 only checks for LLaVAModel. This causes Qwen3VLModel to skip multimodal export logic: line 158 will use the full model instead of model.language_model, and line 349 will not load multimodal components. The export will be incomplete/incorrect.

Update both is_multimodal checks to include Qwen3VLModel:

Suggested fix

- self.is_multimodal = isinstance(model, LLaVAModel) + self.is_multimodal = isinstance(model, (LLaVAModel, Qwen3VLModel))

Apply this change at both line 151 and line 157.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/unified_export_megatron.py` at line 126, The multimodal detection currently only checks for LLaVAModel, causing Qwen3VLModel to be treated like a unimodal model; update the two is_multimodal checks that reference LLaVAModel so they also include Qwen3VLModel (so Qwen3VLModel is detected as multimodal), ensuring the export path uses model.language_model where intended and that multimodal components are loaded (the same checks that gate using model.language_model and loading multimodal parts should include Qwen3VLModel).

coderabbitai · 2026-04-29T04:41:11Z

+            elif "GatedDeltaNet" in str(type(layer.self_attention)):
+                # GatedDeltaNet (linear attention) has in_proj, out_norm, out_proj
+                # instead of linear_qkv, q_layernorm, etc.
+                # Use dedicated GDN rules if available (no QKV slicing), else skip.
+                if "gated_delta_net_in_proj" in self.rules:
+                    self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id)
+                if hasattr(layer.self_attention, "out_norm") and not isinstance(
+                    layer.self_attention.out_norm, IdentityOp
+                ):
+                    if "gated_delta_net_out_norm" in self.rules:
+                        self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id)
+                self.rules["linear_proj"](layer.self_attention.out_proj, layer_id)


⚠️ Potential issue | 🟠 Major

Fail fast when required GatedDeltaNet rules are missing.

This branch currently skips in_proj / out_norm when rule keys are absent, which can silently emit incomplete checkpoints. Please raise an explicit error instead of skipping.

Suggested fix

elif "GatedDeltaNet" in str(type(layer.self_attention)): # GatedDeltaNet (linear attention) has in_proj, out_norm, out_proj # instead of linear_qkv, q_layernorm, etc. - # Use dedicated GDN rules if available (no QKV slicing), else skip. - if "gated_delta_net_in_proj" in self.rules: - self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id) + required_rules = ["linear_proj", "gated_delta_net_in_proj"] + missing_rules = [r for r in required_rules if r not in self.rules] + if missing_rules: + raise KeyError( + f"Missing required export rule(s) for GatedDeltaNet at layer {layer_id}: {missing_rules}" + ) + + self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id) if hasattr(layer.self_attention, "out_norm") and not isinstance( layer.self_attention.out_norm, IdentityOp ): - if "gated_delta_net_out_norm" in self.rules: - self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id) + if "gated_delta_net_out_norm" not in self.rules: + raise KeyError( + f"Missing required export rule 'gated_delta_net_out_norm' for GatedDeltaNet at layer {layer_id}" + ) + self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id) self.rules["linear_proj"](layer.self_attention.out_proj, layer_id)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/unified_export_megatron.py` around lines 465 - 476, The GatedDeltaNet branch can silently skip applying required rules and produce incomplete checkpoints; update the handling in unified_export_megatron.py so that when "GatedDeltaNet" is detected on layer.self_attention you assert/raise an explicit error if the required rule keys are missing (e.g., "gated_delta_net_in_proj" and "linear_proj"), and also raise if out_norm exists and is not IdentityOp but "gated_delta_net_out_norm" is absent; keep the existing calls to self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id), self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id), and self.rules["linear_proj"](layer.self_attention.out_proj, layer_id) but guard them with clear exceptions mentioning the missing rule names.

coderabbitai · 2026-04-29T04:41:11Z

+        # Handle empty tensors (e.g. from TP/EP sharding where this rank has no slice)
+        if input.numel() == 0:
+            return (
+                cls(input_shape, input_dtype, input),
+                torch.zeros(*input.shape[:-1], device=input.device, dtype=torch.float8_e4m3fn),
+                torch.zeros(1, device=input.device, dtype=torch.float32),
+            )


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Locate reduce_amax/reduce_block_amax implementations ==" rg -n -C3 'def reduce_amax|def reduce_block_amax' modelopt/torch/quantization echo echo "== Inspect NVFP4 quantize callers and whether weights_scaling_factor_2 may be None ==" rg -n -C3 'NVFP4QTensor\.quantize\(' modelopt/torch echo echo "== Inspect keep_high_precision call sites/expectations ==" rg -n -C3 'keep_high_precision\s*=\s*True|keep_high_precision=True' modelopt/torch echo echo "== Inspect dequantize unpack assumptions (bitwise on quantized payload) ==" rg -n -C3 '_unpack_tensor|>> 4|& 0x0F' modelopt/torch/quantization/qtensor/nvfp4_tensor.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2895

🏁 Script executed:

# Get the full quantize method signature and implementation cat -n modelopt/torch/quantization/qtensor/nvfp4_tensor.py | sed -n '200,350p'

Repository: NVIDIA/Model-Optimizer

Length of output: 7218

🏁 Script executed:

# Search for the quantize method definition and look for keep_high_precision parameter rg -n 'def quantize' modelopt/torch/quantization/qtensor/nvfp4_tensor.py -A 20

Repository: NVIDIA/Model-Optimizer

Length of output: 1081

🏁 Script executed:

# Check the dequantize implementation to understand what format it expects rg -n 'def dequantize' modelopt/torch/quantization/qtensor/nvfp4_tensor.py -A 30

Repository: NVIDIA/Model-Optimizer

Length of output: 1538

🏁 Script executed:

# Check if keep_high_precision is used anywhere in the file rg -n 'keep_high_precision' modelopt/torch/quantization/qtensor/nvfp4_tensor.py

Repository: NVIDIA/Model-Optimizer

Length of output: 583

🏁 Script executed:

# Verify the return type contract by checking what the caller expects rg -n 'NVFP4QTensor.quantize' modelopt/torch/quantization/nn/modules/tensor_quantizer.py -B 5 -A 10

Repository: NVIDIA/Model-Optimizer

Length of output: 949

🏁 Script executed:

# Check if keep_high_precision=True is ever passed to quantize rg -n 'keep_high_precision\s*=\s*True' modelopt/torch/quantization

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Examine get_weights_scaling_factor to understand return shape rg -n 'def get_weights_scaling_factor' modelopt/torch/quantization/qtensor/nvfp4_tensor.py -A 20

Repository: NVIDIA/Model-Optimizer

Length of output: 3723

🏁 Script executed:

# Get the complete get_weights_scaling_factor method sed -n '138,173p' modelopt/torch/quantization/qtensor/nvfp4_tensor.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1548

🏁 Script executed:

# Also check reduce_block_amax to see what shape/dtype it produces rg -n 'def reduce_block_amax' modelopt/torch/quantization/utils/core_utils.py -A 25

Repository: NVIDIA/Model-Optimizer

Length of output: 1198

🏁 Script executed:

# Check if the keep_high_precision return is ever consumed (any code testing the return type) rg -n 'scaled_weight' modelopt/torch/quantization/qtensor/nvfp4_tensor.py

Repository: NVIDIA/Model-Optimizer

Length of output: 256

Empty-tensor fast path breaks NVFP4 quantization contract

The empty-tensor branch at lines 289-295 violates three critical contracts:

Type contract violation: When keep_high_precision=True (line 228), the method returns a single tensor (line 310), but the empty branch always returns a 3-tuple. This causes a type mismatch if keep_high_precision=True is ever passed.

Data format contract violation: Line 292 stores the padded float tensor directly as quantized payload, but dequantize() (lines 332-333) uses bitwise unpacking (>> 4 and & 0x0F) that expects packed uint8 data. Passing raw float data will cause unpacking to fail.

Scale shape contract violation: The normal path produces per-block scales with shape (*input.shape[:-1], num_blocks) via reduce_block_amax(). Line 293 returns torch.zeros(*input.shape[:-1], ...) with shape (*input.shape[:-1],), omitting the block dimension. This breaks downstream code expecting per-block scales.

Additionally, the empty check at line 290 is placed too late—reductions at lines 251 and 285-286 execute on empty tensors before the early return, potentially producing NaN or 0 values.

Suggested fix (preserve contract for empty tensors)

+ # Handle empty tensors early (e.g. TP/EP ranks with no slice) + if input.numel() == 0: + if keep_high_precision: + return input + + # Keep quantized contract: packed uint8 payload + per-block scales + packed_weight = torch.empty( + (*input.shape[:-1], input.shape[-1] // 2), + device=input.device, + dtype=torch.uint8, + ) + per_block_scale = torch.empty( + (*input.shape[:-1], input.shape[-1] // block_size), + device=input.device, + dtype=torch.float8_e4m3fn, + ) + if weights_scaling_factor_2 is None: + weights_scaling_factor_2 = torch.zeros( + 1, device=input.device, dtype=torch.float32 + ) + return ( + cls(input_shape, input_dtype, packed_weight), + per_block_scale, + weights_scaling_factor_2, + ) + if weights_scaling_factor_2 is None: weights_scaling_factor_2 = cls.get_weights_scaling_factor_2(input) # try call trtllm fp4 quantization if possible if (

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py` around lines 289 - 295, The empty-tensor branch must preserve the same return contract as the non-empty path and avoid running reductions on empty inputs: move the early-return check on input.numel() to before any calls to reduce_block_amax() or other reductions; when empty, compute num_blocks = ceil_div(input.shape[-1], BLOCK_SIZE) (same logic used elsewhere) and return the same tuple structure as the non-empty paths—i.e., if keep_high_precision is True return the high-precision payload only (matching the branch that returns at line ~310), otherwise return (cls(input_shape, input_dtype, input), a properly-shaped packed uint8 payload filled with zeros matching the packed layout dequantize() expects, and a scale tensor shaped (*input.shape[:-1], num_blocks) filled with zeros); ensure the payload dtype/format matches dequantize()'s bitwise unpacking (packed uint8 with low/high nibbles), and use the same device/dtype conventions as reduce_block_amax() and dequantize().

…attn.o_proj)

With EP=2, local_experts are indexed 0..127 per rank but global IDs must account for EP rank. rank 0 → 0-127, rank 1 → 128-255.

…mbda dispatch

Three bugs fixed for multi-rank EP MoE export (Qwen3.6-35B-A3B): 1. Format string bug: fc1/fc2 prefix has two {} placeholders (layer_id, expert_id). Using .format(layer_id) fails. Fixed with re.sub to fill only first {}. 2. Expert offset bug: _grouped_mlp_slicing had no EP rank awareness. Both ranks wrote experts 0-127 with overlapping keys. Added expert_offset param from get_expert_model_parallel_rank() * num_local_experts. 3. weight_key bug: used global expert_id for module lookup instead of local_expert_id. Module has weight0..weight127, not weight128..weight255. 4. Save strategy: all_gather_object causes OOM (pickle overhead on ~40k tensors). Each rank now writes to separate NFS dir, then rank 0 merges safetensors shard-by-shard with low memory footprint.

… dispatch - Add GroupedGatedMLPSlicing class to mcore_custom.py for TEGroupedMLP gate/up split - Add _grouped_gated_mlp_slicing method to GPTModelExporter - Clone shared-storage tensors before safetensors save (NVFP4 weight_scale broadcast) - Dispatch fc1 slicing based on mapping func_name for correct expert handling

…n_hidden_size Both GatedMLPSlicing and GroupedGatedMLPSlicing used module.config.ffn_hidden_size to determine the gate/up split point. For MoE models, ffn_hidden_size is often set to hidden_size (2048) rather than the per-expert intermediate size (512), causing gate_proj to receive the full fused weight and up_proj to be empty [0, N]. Now derives gated_split from the actual weight tensor shape (rows // 2).

erictinkeredapps requested review from a team as code owners April 22, 2026 18:36

erictinkeredapps requested review from ajrasane and jingyu-ml April 22, 2026 18:36

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread modelopt/torch/export/quant_utils.py Outdated

fix: iterate quantizer list to find first enabled quantizer

5d5c492

CodeRabbit review: expert 0 may be disabled when uncalibrated, so checking only quantizer_list[0] can miss the actual NVFP4 config. Now iterates to find the first enabled quantizer in the list.

gaby mentioned this pull request Apr 23, 2026

What’s the Correct Way to Quantize Qwen3.5 (MoE/Dense) to NVFP4? #1255

Open

lennytinkeredapps added 2 commits April 29, 2026 00:22

fix: correct HybridModel import path (hybrid.hybrid_model, not gpt.hy…

0ddd356

…brid_model)

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

lennytinkeredapps added 3 commits April 29, 2026 00:30

fix: GDN in_proj uses dedicated rule (no QKV slicing), add GDN rules …

9510d89

…to qwen3 mapping

fix: add shared_experts rules to qwen3 export mapping for Qwen3.6 MoE

3b863b7

fix: handle empty tensors in NVFP4QTensor.quantize (TP/EP sharding ze…

344badb

…ro-slice)

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

lennytinkeredapps added 6 commits April 29, 2026 00:50

fix: GDN out_proj uses dedicated rule (linear_attn.out_proj not self_…

1ca8f7a

…attn.o_proj)

fix: add EP rank offset to expert_id during export

4992fb0

With EP=2, local_experts are indexed 0..127 per rank but global IDs must account for EP rank. rank 0 → 0-127, rank 1 → 128-255.

debug: add layer timing logs to export _get_state_dict

e254f81

debug: add expert iteration logging to export

5cfadba

debug: more trace prints around export state dict building

83c0676

debug: clean MLP diagnostic inside _get_transformer_layer_state_dict

f1e2944

lennytinkeredapps force-pushed the fix-qwen3x-moe-nvfp4-export branch from a12351d to f1e2944 Compare April 29, 2026 06:17

lennytinkeredapps added 4 commits April 29, 2026 10:28

fix: add GroupedMLPSlicing for TEGroupedMLP export + bypass broken la…

f693f35

…mbda dispatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs)#1323

Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs)#1323
erictinkeredapps wants to merge 17 commits intoNVIDIA:mainfrom
Tinkered-Apps:fix-qwen3x-moe-nvfp4-export

erictinkeredapps commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

coderabbitai Bot Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erictinkeredapps commented Apr 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Environment

Bug Details

Bug 1: is_multimodal_model() crashes on None architectures

Bug 2: (Usage issue, not a code bug — fixed in caller)

Bug 3: get_quantization_format() does not recognize _QuantFusedExperts

Bug 4: NVFP4 config wildcards do not match plural quantizer names

Bug 5: _process_quantized_modules elif order sends fused MoE to wrong export path

Changes

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erictinkeredapps commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

Bug 1: `is_multimodal_model()` crashes on `None` architectures

Bug 3: `get_quantization_format()` does not recognize `_QuantFusedExperts`

Bug 5: `_process_quantized_modules` elif order sends fused MoE to wrong export path

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading