Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs)#1323
Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs)#1323erictinkeredapps wants to merge 17 commits intoNVIDIA:mainfrom
Conversation
Four bugs prevent NVFP4 export from producing quantized weights for
Qwen3.5/3.6 MoE models (and potentially other fused MoE architectures).
All produce silent failures — no errors, just bfloat16 output identical
to input.
Bug 1: is_multimodal_model() crashes when config.architectures is None
- model_utils.py: add 'or []' fallback for NoneType iteration
Bug 3: get_quantization_format() doesn't recognize _QuantFusedExperts
- quant_utils.py: add check for plural ModuleList quantizers
(gate_up_proj_weight_quantizers, down_proj_weight_quantizers)
before the singular weight_quantizer loop
Bug 4: NVFP4 config wildcards don't match plural quantizer names
- config.py: _nvfp4_selective_quant_cfg() only generates patterns
for singular 'weight_quantizer', but _QuantFusedExperts creates
plural ModuleList quantizers. Add wildcard entries for both
gate_up_proj_weight_quantizers* and down_proj_weight_quantizers*
Bug 5: _process_quantized_modules elif order sends fused MoE to wrong path
- unified_export_hf.py: swap elif branches so hasattr check for
gate_up_proj_weight_quantizers comes before type-name checks.
Without this, QuantQwen3_5MoeExperts hits the singular-attribute
branch and crashes with AttributeError
Tested on: Qwen3.6-35B-A3B (MoE), NVIDIA DGX Spark (GB10),
modelopt 0.45.0 dev, transformers 5.5.4
Output: 20.5 GB NVFP4 (down from 66 GB bfloat16)
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughUpdates export/quantization to detect and handle fused Mixture-of-Experts (MoE) modules: normalizes multimodal detection, adds plural-quantizer NVFP4 detection, reorders fused-expert export, extends NVFP4 config patterns, and registers a Megatron↔HF mapping. Changes
Sequence Diagram(s)sequenceDiagram
participant Exporter as Exporter
participant Scanner as Module Scanner
participant QuantDetector as Quant format detector
participant FusedExporter as Fused Experts Exporter
participant NVFP4Cfg as NVFP4 Config Generator
Exporter->>Scanner: enumerate sub-modules
Scanner->>QuantDetector: inspect module attributes
alt module has plural quantizers
QuantDetector->>QuantDetector: pick enabled/first quantizer, read num_bits & scale_bits
QuantDetector-->>Exporter: return QUANTIZATION_NVFP4
Exporter->>FusedExporter: call _export_fused_experts(..., reshard=False)
else other quantized module
QuantDetector-->>Exporter: return other quant format
Exporter->>Exporter: follow existing export branches
end
Exporter->>NVFP4Cfg: request selective NVFP4 patterns
NVFP4Cfg-->>Exporter: include plural-quantizer wildcard patterns
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/quant_utils.py`:
- Around line 673-685: The code currently only inspects quantizer_list[0] to
detect NVFP4, which misses cases where expert 0 is disabled; update the logic in
the detection block to iterate over quantizer_list and find the first quantizer
q where hasattr(q, "is_enabled") and q.is_enabled (or otherwise any enabled
quantizer), then read num_bits, block_sizes and compute scale_bits from that
enabled q and return QUANTIZATION_NVFP4 when matching (num_bits == (2, 1) and
scale_bits == (4, 3)); ensure the fallback to QUANTIZATION_NONE only happens
after checking all quantizers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: f0da2041-19c7-4d5f-b7a8-ecfb8c942950
📒 Files selected for processing (4)
modelopt/torch/export/model_utils.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/config.py
CodeRabbit review: expert 0 may be disabled when uncalibrated, so checking only quantizer_list[0] can miss the actual NVFP4 config. Now iterates to find the first enabled quantizer in the list.
…nce check - Add Qwen3_5MoeForConditionalGeneration to export/import mappings - Add Qwen3VLModel + HybridModel to GPTModelExporter isinstance check - Handle GatedDeltaNet layers in _get_transformer_layer_state_dict - Fix quantizer format detection for disabled quantizers
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/export/unified_export_megatron.py (1)
465-494:⚠️ Potential issue | 🟠 MajorGuard the new
softmax_offsetexport path.This branch indexes
self.rules["softmax_offset"]unconditionally once the attribute exists. If the active rule book does not define that key for a supported attention variant, export will raiseKeyErroreven though the rest of the path is valid.🔧 Suggested fix
- if hasattr(layer.self_attention, "core_attention") and getattr(layer.self_attention.core_attention, "softmax_offset", None) is not None: - self.rules["softmax_offset"]( + if ( + "softmax_offset" in self.rules + and hasattr(layer.self_attention, "core_attention") + and getattr(layer.self_attention.core_attention, "softmax_offset", None) is not None + ): + self.rules["softmax_offset"]( layer.self_attention.core_attention.softmax_offset, layer_id )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/unified_export_megatron.py` around lines 465 - 494, The code unconditionally accesses self.rules["softmax_offset"] when layer.self_attention.core_attention has softmax_offset, which can raise KeyError if the active rule book lacks that key; update the final branch that currently checks getattr(layer.self_attention.core_attention, "softmax_offset", None) to also verify "softmax_offset" in self.rules (and optionally that it's callable) before invoking self.rules["softmax_offset"](layer.self_attention.core_attention.softmax_offset, layer_id), keeping the existing hasattr/getattr checks for core_attention/softmax_offset and referencing the symbols self.rules, "softmax_offset", layer.self_attention.core_attention, and layer_id.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 465-494: The code unconditionally accesses
self.rules["softmax_offset"] when layer.self_attention.core_attention has
softmax_offset, which can raise KeyError if the active rule book lacks that key;
update the final branch that currently checks
getattr(layer.self_attention.core_attention, "softmax_offset", None) to also
verify "softmax_offset" in self.rules (and optionally that it's callable) before
invoking
self.rules["softmax_offset"](layer.self_attention.core_attention.softmax_offset,
layer_id), keeping the existing hasattr/getattr checks for
core_attention/softmax_offset and referencing the symbols self.rules,
"softmax_offset", layer.self_attention.core_attention, and layer_id.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 87ec7afa-9907-4194-886c-93fccd2f666d
📒 Files selected for processing (3)
modelopt/torch/export/plugins/mcore_common.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_megatron.py
✅ Files skipped from review due to trivial changes (1)
- modelopt/torch/export/plugins/mcore_common.py
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 465-476: The GatedDeltaNet branch can silently skip applying
required rules and produce incomplete checkpoints; update the handling in
unified_export_megatron.py so that when "GatedDeltaNet" is detected on
layer.self_attention you assert/raise an explicit error if the required rule
keys are missing (e.g., "gated_delta_net_in_proj" and "linear_proj"), and also
raise if out_norm exists and is not IdentityOp but "gated_delta_net_out_norm" is
absent; keep the existing calls to
self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id),
self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id),
and self.rules["linear_proj"](layer.self_attention.out_proj, layer_id) but guard
them with clear exceptions mentioning the missing rule names.
- Line 126: The multimodal detection currently only checks for LLaVAModel,
causing Qwen3VLModel to be treated like a unimodal model; update the two
is_multimodal checks that reference LLaVAModel so they also include Qwen3VLModel
(so Qwen3VLModel is detected as multimodal), ensuring the export path uses
model.language_model where intended and that multimodal components are loaded
(the same checks that gate using model.language_model and loading multimodal
parts should include Qwen3VLModel).
In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py`:
- Around line 289-295: The empty-tensor branch must preserve the same return
contract as the non-empty path and avoid running reductions on empty inputs:
move the early-return check on input.numel() to before any calls to
reduce_block_amax() or other reductions; when empty, compute num_blocks =
ceil_div(input.shape[-1], BLOCK_SIZE) (same logic used elsewhere) and return the
same tuple structure as the non-empty paths—i.e., if keep_high_precision is True
return the high-precision payload only (matching the branch that returns at line
~310), otherwise return (cls(input_shape, input_dtype, input), a properly-shaped
packed uint8 payload filled with zeros matching the packed layout dequantize()
expects, and a scale tensor shaped (*input.shape[:-1], num_blocks) filled with
zeros); ensure the payload dtype/format matches dequantize()'s bitwise unpacking
(packed uint8 with low/high nibbles), and use the same device/dtype conventions
as reduce_block_amax() and dequantize().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a211fed2-1afc-4bf1-b1a1-bcd8e4e8a759
📒 Files selected for processing (3)
modelopt/torch/export/plugins/mcore_qwen.pymodelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.py
| ): | ||
| """Create a GPTModel exporter instance.""" | ||
| if not isinstance(model, (GPTModel, MambaModel, LLaVAModel)): | ||
| if not isinstance(model, (GPTModel, MambaModel, HybridModel, LLaVAModel, Qwen3VLModel)): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify accepted model classes vs multimodal routing logic in this exporter.
rg -n -C2 'Qwen3VLModel|isinstance\(model, \(GPTModel, MambaModel, HybridModel, LLaVAModel, Qwen3VLModel\)\)|is_multimodal = isinstance\(model, LLaVAModel\)|self\.model = model\.language_model if self\.is_multimodal else model|load_multimodal_components' modelopt/torch/export/unified_export_megatron.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1923
Qwen3VLModel is accepted but not routed through multimodal export handling.
Line 126 accepts Qwen3VLModel in the isinstance check, but multimodal detection at lines 151 and 157 only checks for LLaVAModel. This causes Qwen3VLModel to skip multimodal export logic: line 158 will use the full model instead of model.language_model, and line 349 will not load multimodal components. The export will be incomplete/incorrect.
Update both is_multimodal checks to include Qwen3VLModel:
Suggested fix
- self.is_multimodal = isinstance(model, LLaVAModel)
+ self.is_multimodal = isinstance(model, (LLaVAModel, Qwen3VLModel))Apply this change at both line 151 and line 157.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/export/unified_export_megatron.py` at line 126, The multimodal
detection currently only checks for LLaVAModel, causing Qwen3VLModel to be
treated like a unimodal model; update the two is_multimodal checks that
reference LLaVAModel so they also include Qwen3VLModel (so Qwen3VLModel is
detected as multimodal), ensuring the export path uses model.language_model
where intended and that multimodal components are loaded (the same checks that
gate using model.language_model and loading multimodal parts should include
Qwen3VLModel).
| elif "GatedDeltaNet" in str(type(layer.self_attention)): | ||
| # GatedDeltaNet (linear attention) has in_proj, out_norm, out_proj | ||
| # instead of linear_qkv, q_layernorm, etc. | ||
| # Use dedicated GDN rules if available (no QKV slicing), else skip. | ||
| if "gated_delta_net_in_proj" in self.rules: | ||
| self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id) | ||
| if hasattr(layer.self_attention, "out_norm") and not isinstance( | ||
| layer.self_attention.out_norm, IdentityOp | ||
| ): | ||
| if "gated_delta_net_out_norm" in self.rules: | ||
| self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id) | ||
| self.rules["linear_proj"](layer.self_attention.out_proj, layer_id) |
There was a problem hiding this comment.
Fail fast when required GatedDeltaNet rules are missing.
This branch currently skips in_proj / out_norm when rule keys are absent, which can silently emit incomplete checkpoints. Please raise an explicit error instead of skipping.
Suggested fix
elif "GatedDeltaNet" in str(type(layer.self_attention)):
# GatedDeltaNet (linear attention) has in_proj, out_norm, out_proj
# instead of linear_qkv, q_layernorm, etc.
- # Use dedicated GDN rules if available (no QKV slicing), else skip.
- if "gated_delta_net_in_proj" in self.rules:
- self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id)
+ required_rules = ["linear_proj", "gated_delta_net_in_proj"]
+ missing_rules = [r for r in required_rules if r not in self.rules]
+ if missing_rules:
+ raise KeyError(
+ f"Missing required export rule(s) for GatedDeltaNet at layer {layer_id}: {missing_rules}"
+ )
+
+ self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id)
if hasattr(layer.self_attention, "out_norm") and not isinstance(
layer.self_attention.out_norm, IdentityOp
):
- if "gated_delta_net_out_norm" in self.rules:
- self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id)
+ if "gated_delta_net_out_norm" not in self.rules:
+ raise KeyError(
+ f"Missing required export rule 'gated_delta_net_out_norm' for GatedDeltaNet at layer {layer_id}"
+ )
+ self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id)
self.rules["linear_proj"](layer.self_attention.out_proj, layer_id)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/export/unified_export_megatron.py` around lines 465 - 476, The
GatedDeltaNet branch can silently skip applying required rules and produce
incomplete checkpoints; update the handling in unified_export_megatron.py so
that when "GatedDeltaNet" is detected on layer.self_attention you assert/raise
an explicit error if the required rule keys are missing (e.g.,
"gated_delta_net_in_proj" and "linear_proj"), and also raise if out_norm exists
and is not IdentityOp but "gated_delta_net_out_norm" is absent; keep the
existing calls to
self.rules["gated_delta_net_in_proj"](layer.self_attention.in_proj, layer_id),
self.rules["gated_delta_net_out_norm"](layer.self_attention.out_norm, layer_id),
and self.rules["linear_proj"](layer.self_attention.out_proj, layer_id) but guard
them with clear exceptions mentioning the missing rule names.
| # Handle empty tensors (e.g. from TP/EP sharding where this rank has no slice) | ||
| if input.numel() == 0: | ||
| return ( | ||
| cls(input_shape, input_dtype, input), | ||
| torch.zeros(*input.shape[:-1], device=input.device, dtype=torch.float8_e4m3fn), | ||
| torch.zeros(1, device=input.device, dtype=torch.float32), | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Locate reduce_amax/reduce_block_amax implementations =="
rg -n -C3 'def reduce_amax|def reduce_block_amax' modelopt/torch/quantization
echo
echo "== Inspect NVFP4 quantize callers and whether weights_scaling_factor_2 may be None =="
rg -n -C3 'NVFP4QTensor\.quantize\(' modelopt/torch
echo
echo "== Inspect keep_high_precision call sites/expectations =="
rg -n -C3 'keep_high_precision\s*=\s*True|keep_high_precision=True' modelopt/torch
echo
echo "== Inspect dequantize unpack assumptions (bitwise on quantized payload) =="
rg -n -C3 '_unpack_tensor|>> 4|& 0x0F' modelopt/torch/quantization/qtensor/nvfp4_tensor.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2895
🏁 Script executed:
# Get the full quantize method signature and implementation
cat -n modelopt/torch/quantization/qtensor/nvfp4_tensor.py | sed -n '200,350p'Repository: NVIDIA/Model-Optimizer
Length of output: 7218
🏁 Script executed:
# Search for the quantize method definition and look for keep_high_precision parameter
rg -n 'def quantize' modelopt/torch/quantization/qtensor/nvfp4_tensor.py -A 20Repository: NVIDIA/Model-Optimizer
Length of output: 1081
🏁 Script executed:
# Check the dequantize implementation to understand what format it expects
rg -n 'def dequantize' modelopt/torch/quantization/qtensor/nvfp4_tensor.py -A 30Repository: NVIDIA/Model-Optimizer
Length of output: 1538
🏁 Script executed:
# Check if keep_high_precision is used anywhere in the file
rg -n 'keep_high_precision' modelopt/torch/quantization/qtensor/nvfp4_tensor.pyRepository: NVIDIA/Model-Optimizer
Length of output: 583
🏁 Script executed:
# Verify the return type contract by checking what the caller expects
rg -n 'NVFP4QTensor.quantize' modelopt/torch/quantization/nn/modules/tensor_quantizer.py -B 5 -A 10Repository: NVIDIA/Model-Optimizer
Length of output: 949
🏁 Script executed:
# Check if keep_high_precision=True is ever passed to quantize
rg -n 'keep_high_precision\s*=\s*True' modelopt/torch/quantizationRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Examine get_weights_scaling_factor to understand return shape
rg -n 'def get_weights_scaling_factor' modelopt/torch/quantization/qtensor/nvfp4_tensor.py -A 20Repository: NVIDIA/Model-Optimizer
Length of output: 3723
🏁 Script executed:
# Get the complete get_weights_scaling_factor method
sed -n '138,173p' modelopt/torch/quantization/qtensor/nvfp4_tensor.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1548
🏁 Script executed:
# Also check reduce_block_amax to see what shape/dtype it produces
rg -n 'def reduce_block_amax' modelopt/torch/quantization/utils/core_utils.py -A 25Repository: NVIDIA/Model-Optimizer
Length of output: 1198
🏁 Script executed:
# Check if the keep_high_precision return is ever consumed (any code testing the return type)
rg -n 'scaled_weight' modelopt/torch/quantization/qtensor/nvfp4_tensor.pyRepository: NVIDIA/Model-Optimizer
Length of output: 256
Empty-tensor fast path breaks NVFP4 quantization contract
The empty-tensor branch at lines 289-295 violates three critical contracts:
-
Type contract violation: When
keep_high_precision=True(line 228), the method returns a single tensor (line 310), but the empty branch always returns a 3-tuple. This causes a type mismatch ifkeep_high_precision=Trueis ever passed. -
Data format contract violation: Line 292 stores the padded float tensor directly as quantized payload, but
dequantize()(lines 332-333) uses bitwise unpacking (>> 4and& 0x0F) that expects packed uint8 data. Passing raw float data will cause unpacking to fail. -
Scale shape contract violation: The normal path produces per-block scales with shape
(*input.shape[:-1], num_blocks)viareduce_block_amax(). Line 293 returnstorch.zeros(*input.shape[:-1], ...)with shape(*input.shape[:-1],), omitting the block dimension. This breaks downstream code expecting per-block scales.
Additionally, the empty check at line 290 is placed too late—reductions at lines 251 and 285-286 execute on empty tensors before the early return, potentially producing NaN or 0 values.
Suggested fix (preserve contract for empty tensors)
+ # Handle empty tensors early (e.g. TP/EP ranks with no slice)
+ if input.numel() == 0:
+ if keep_high_precision:
+ return input
+
+ # Keep quantized contract: packed uint8 payload + per-block scales
+ packed_weight = torch.empty(
+ (*input.shape[:-1], input.shape[-1] // 2),
+ device=input.device,
+ dtype=torch.uint8,
+ )
+ per_block_scale = torch.empty(
+ (*input.shape[:-1], input.shape[-1] // block_size),
+ device=input.device,
+ dtype=torch.float8_e4m3fn,
+ )
+ if weights_scaling_factor_2 is None:
+ weights_scaling_factor_2 = torch.zeros(
+ 1, device=input.device, dtype=torch.float32
+ )
+ return (
+ cls(input_shape, input_dtype, packed_weight),
+ per_block_scale,
+ weights_scaling_factor_2,
+ )
+
if weights_scaling_factor_2 is None:
weights_scaling_factor_2 = cls.get_weights_scaling_factor_2(input)
# try call trtllm fp4 quantization if possible
if (🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py` around lines 289 - 295,
The empty-tensor branch must preserve the same return contract as the non-empty
path and avoid running reductions on empty inputs: move the early-return check
on input.numel() to before any calls to reduce_block_amax() or other reductions;
when empty, compute num_blocks = ceil_div(input.shape[-1], BLOCK_SIZE) (same
logic used elsewhere) and return the same tuple structure as the non-empty
paths—i.e., if keep_high_precision is True return the high-precision payload
only (matching the branch that returns at line ~310), otherwise return
(cls(input_shape, input_dtype, input), a properly-shaped packed uint8 payload
filled with zeros matching the packed layout dequantize() expects, and a scale
tensor shaped (*input.shape[:-1], num_blocks) filled with zeros); ensure the
payload dtype/format matches dequantize()'s bitwise unpacking (packed uint8 with
low/high nibbles), and use the same device/dtype conventions as
reduce_block_amax() and dequantize().
With EP=2, local_experts are indexed 0..127 per rank but global IDs must account for EP rank. rank 0 → 0-127, rank 1 → 128-255.
a12351d to
f1e2944
Compare
Three bugs fixed for multi-rank EP MoE export (Qwen3.6-35B-A3B):
1. Format string bug: fc1/fc2 prefix has two {} placeholders (layer_id, expert_id).
Using .format(layer_id) fails. Fixed with re.sub to fill only first {}.
2. Expert offset bug: _grouped_mlp_slicing had no EP rank awareness. Both ranks
wrote experts 0-127 with overlapping keys. Added expert_offset param from
get_expert_model_parallel_rank() * num_local_experts.
3. weight_key bug: used global expert_id for module lookup instead of local_expert_id.
Module has weight0..weight127, not weight128..weight255.
4. Save strategy: all_gather_object causes OOM (pickle overhead on ~40k tensors).
Each rank now writes to separate NFS dir, then rank 0 merges safetensors
shard-by-shard with low memory footprint.
… dispatch - Add GroupedGatedMLPSlicing class to mcore_custom.py for TEGroupedMLP gate/up split - Add _grouped_gated_mlp_slicing method to GPTModelExporter - Clone shared-storage tensors before safetensors save (NVFP4 weight_scale broadcast) - Dispatch fc1 slicing based on mapping func_name for correct expert handling
…n_hidden_size Both GatedMLPSlicing and GroupedGatedMLPSlicing used module.config.ffn_hidden_size to determine the gate/up split point. For MoE models, ffn_hidden_size is often set to hidden_size (2048) rather than the per-expert intermediate size (512), causing gate_proj to receive the full fused weight and up_proj to be empty [0, N]. Now derives gated_split from the actual weight tensor shape (rows // 2).
Summary
Four bugs prevent NVFP4 quantization from producing quantized weights for Qwen3.5/3.6 MoE models (and likely other fused MoE architectures using
_QuantFusedExperts). All four produce silent failures — no errors, just bfloat16 output identical to the input model.Test Environment
Bug Details
Bug 1:
is_multimodal_model()crashes onNonearchitecturesFile:
modelopt/torch/export/model_utils.pyModels with
config.architectures = None(common for fine-tuned checkpoints) crash whenis_multimodal_model()iterates the list. One-line fix:or []fallback.Bug 2: (Usage issue, not a code bug — fixed in caller)
Bug 3:
get_quantization_format()does not recognize_QuantFusedExpertsFile:
modelopt/torch/export/quant_utils.pyThe function iterates
weight_attr_names(module)which returns singular attribute names._QuantFusedExpertsmodules use plural ModuleList quantizers (gate_up_proj_weight_quantizers.N), so the function returnsNoneand the module is treated as unquantized. Added a pre-check for plural ModuleList quantizers before the singular loop.Bug 4: NVFP4 config wildcards do not match plural quantizer names
File:
modelopt/torch/quantization/config.py_nvfp4_selective_quant_cfg()generates patterns like*mlp.experts*weight_quantizer(singular)._QuantFusedExpertscreates quantizers namedgate_up_proj_weight_quantizers.0(plural + index). Thefnmatchfails, quantizers never receive NVFP4 config, and 100% stay at disabled default. Added wildcard entries for both plural suffix patterns.Bug 5:
_process_quantized_moduleselif order sends fused MoE to wrong export pathFile:
modelopt/torch/export/unified_export_hf.pyTwo elif branches: one checks type name (
"Llama4TextExperts" in type().__name__), the other checkshasattr("gate_up_proj_weight_quantizers"). After_QuantFusedExpertswrapping,QuantQwen3_5MoeExpertsmatches the type-name branch, which calls_export_quantized_weight()looking for singular attributes →AttributeError. Swapped the elif order so the plural-attribute check runs first.Changes
modelopt/torch/export/model_utils.pyor []fallbackmodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/config.pySummary by CodeRabbit
Bug Fixes
New Features