Declarative checkpoint config conversion by jlamypoirier · Pull Request #508 · ServiceNow/Fast-LLM

jlamypoirier · 2026-05-05T18:02:39Z

Summary

Replaces the imperative import_config / export_config methods on checkpoint converter classes with declarative ConfigConverter primitives keyed off FieldHint.architecture. Two recursive coverage walkers — one over Fast-LLM architecture fields, one over the HF dict — guarantee that every architecture-significant field and every HF dict key is accounted for on import and export. Unsupported feature detection is driven from a single source instead of scattered per-converter asserts.

What's in

Primitives (fast_llm/engine/checkpoint/external.py): Rename, ConstantImport, ConstantExport, Optional, Default, Ignored, ImportOnly, Custom, plus the recursive Nested / Dispatch / TypedDictContainer. Each declares the Fast-LLM paths and HF paths it consumes so the walkers can verify coverage on both sides.
ConfigSectionConverter ABC: cached _create_config_converters per class (subclasses extend the parent's dict by re-declaring keys), _validate_export hook for format-specific cross-field invariants, and recursive check_architecture_coverage / check_hf_coverage.
Architecture-hint reclassification (eight fields): attention dense_layer and softmax_scale_power, MLP activation, MoE router, Llama3 / Yarn rotary scaling parameters, StochasticMixer.main_mixer_name, vision patch_height / patch_width, and PatternBlockSequenceConfig.blocks. These are what the coverage walkers now require to be consumed.
GPT-side migration: llama, mistral, qwen2 (including the MRoPE guard declaratively as an ImportOnlyConfigConverter), mtp_llama, mixtral, diffusion_dream, diffusion_llama, apriel2 (text), the apriel hybrid-SSM mixers (Mamba, GatedDeltaNet, KimiDeltaAttention), and gemma4 — the latter as a declarative shell over imperative cross-block helpers (Gemma 4's HF format merges sliding + full attention into one config and cross-references hidden_size).
Multimodal migration: Llava (vision adapter, vision model, base model) and multimodal Apriel2 (vision attention/block/MLP/encoder/embeddings/adapter/model and the top-level multimodal base) are now ConfigSectionConverters. The Llava adapter is declared at the base-model scope so its intermediate_size import can cross-reference text_config.hidden_size. Apriel2 vision injects patch_size / max_image_size rotary metadata at the vision-model scope (the smallest scope that sees both embeddings.patch_height and the encoder rotary subtree).
Decoder dispatcher flattening: Mistral / Qwen2 / Mixtral DecoderConverter subclasses deleted. LlamaBaseModelConverter inlines the Fixed/Pattern dispatch parameterised by block_converter_class. The imperative LlamaDecoderConverter helper class survives because Pixtral's vision encoder dispatch and Apriel's per-position hybrid_block_layout dispatch don't fit the inlined common case; AprielBaseModelConverter overrides the "decoder" declaration to delegate to AprielDecoderConverter via a new apriel_decoder_converter_class ClassVar.
HF coverage allowlist propagation: check_hf_coverage now matches the allowlist against any segment of the walked path (not just top-level keys). transformers' generic PretrainedConfig metadata (architectures, torch_dtype, transformers_version, …) is now accepted at any depth — relevant for Llava's vision_config and similar nested sub-configs that transformers auto-populates on round-trip.
Static test (tests/models/test_converters.py): walks every registered handler's converter tree, runs check_architecture_coverage on each section node, and validates that every OptionalConfigConverter sentinel matches the field's resolved default (so an exported sentinel-equal value can't silently drift on re-import).

Main merge

Brings in the new architecture fields landed since branch cut: AttentionConfig.{query,key,value}_norm + shared_key_value, MLPConfig.{pre,post}_norm, DecoderBlockConfig.{pre,post}_{mixer,mlp}_normalization + output_scale, LanguageModelEmbeddingsConfig.embedding_scale, LanguageModelHeadConfig.final_logit_softcap, MoEMLPConfig.{router_normalization, router_scale, router_input_scale, router_per_expert_scale}. Each new field is claimed on every section converter whose fast_llm_config_class is the field's container via ConstantImportConfigConverter (asserts the Fast-LLM default on export, injects the default on import); OptionalParameterConfig fields (output_scale, router_scale, router_per_expert_scale) are claimed via IgnoredConfigConverter + a _validate_export not enabled assertion to preserve the imperative reject-non-default behaviour.

Notable shape decisions

Coverage check is type-strict (type(config) is cls.fast_llm_config_class). Strict subclasses defer to a more specific converter, which kept partially migrated callers working through super() during the multi-step migration.
HF-side coverage walker consumes prefixes registered by every declaration, descending through Nested / Dispatch sub-converters under their hf_path. IgnoredConfigConverter(hf_paths=...) is the explicit opt-out for HF-only fields with no Fast-LLM counterpart (Mixtral router-runtime toggles, Qwen2 sliding-window, …) and for the model-specific defaults transformers' Pixtral/Llava fill on round-trip (head_dim, image_size, layer_norm_eps, projection_dim, vocab_size, image_seq_length, tie_word_embeddings at the Llava level).
_create_config_converters is @functools.cached. Subclasses must return a fresh dict ({**super()._create_config_converters(), ...}) — mutating the parent's returned dict would corrupt its cache entry. Documented on the base method.
IgnoredConfigConverter is recursive (recurses=True). Used for sub-configs with no architecture leaves (ParameterConfig, Mixtral's router sub-config) and for HF-only fields. Non-architecture fields (lr_scale, apply_peft, initialization sub-config) are by design not part of the HF round-trip; Fast-LLM keeps them on the in-memory config independently.

What's deferred

Weight-converter declarative refactor. The weight side still uses today's WeightConverter subclasses (SplitWeightConverter, MLPLayer2Converter, KeyValueWeightConverter, …) and the per-converter get_converters classmethods.
Gemma 4 / Apriel block-level declarative. Gemma 4's Gemma4BlockConverter / Gemma4DecoderConverter and Apriel's AprielBlockConverter / AprielDecoderConverter remain imperative — Gemma 4 because the HF format merges sliding + full attention into one config and cross-references hidden_size (doesn't fit per-section decomposition), Apriel because hybrid_block_layout is a positional list discriminator (would need a new ListDispatchConfigConverter primitive).
A handful of structural follow-ups recorded for future review rounds: IgnoredConfigConverter as a _ignored_fields ClassVar, get_converters signature uniformity, LlamaDecoderConverter as a ConfigSectionConverter, IgnoredConfigConverter default-round-trip maintenance test.

Test plan

pytest -v -n 8 tests/models/test_checkpoint.py tests/models/test_hf_roundtrip.py tests/models/test_converters.py: 271 passed, no failures (gemma4 dependency group now included).
pytest -v -n 8 fast_llm_external_models/tests/: 2109 passed, 42 skipped (separate invocation per CLAUDE.md).
Manual smoke: fast-llm convert --input.format <fmt> --input.path <ref> --output.format <fmt> --output.path <tmp>; reload both and compare configs.

🤖 Generated with Claude Code

Eight config fields whose values directly affect model architecture were tagged as feature/core/(none). They drive the upcoming declarative-converter coverage check, which uses FieldHint.architecture as the source of truth for "must be handled by every checkpoint format". - AttentionConfig.dense_layer (output projection presence) - AttentionConfig.softmax_scale_power (attention scaling) - MLPConfig.activation (forward-pass activation type) - MoEMLPConfig.router (routing weights drive token assignment) - Llama3RotaryConfig: scale_factor, low_frequency_factor, high_frequency_factor, original_context_length - YarnRotaryConfig: scale_factor, attention_factor, beta_fast, beta_slow, original_context_length - StochasticMixerConfig.main_mixer_name (selects inference mixer) - PatchEmbeddingsConfig.patch_height/patch_width (input tokenization) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reintroduces the declarative config-conversion shape that pre-dated PR #362, applied within the post-#362 modular per-section structure. Replaces the imperative import_config/export_config bodies with a small set of named primitives and a recursive walker driven by per-section declarations. Primitives in fast_llm.engine.checkpoint.external: - RenameConfigConverter — 1:1 path rename - ConstantExportConfigConverter — write constant on export, assert on import - ConstantImportConfigConverter — assert on export, inject on import - DefaultConfigConverter — rename with HF-side fallback - OptionalConfigConverter — emit/import only when non-sentinel - IgnoredConfigConverter — declare a field as intentionally not converted - CustomConfigConverter — escape hatch for cross-field transforms - NestedConfigConverter — recurse into a fixed-typed sub-config; flat-merges HF output into the parent (transformer side is assumed flat) - DispatchConfigConverter — runtime type dispatch for polymorphic sub-configs ConfigSectionConverter is the per-Fast-LLM-class converter base. Subclasses declare their conversion via _create_config_converters() and inherit import_config/export_config concretely. The architecture-coverage check fires only when type(config) exactly matches the converter's declared fast_llm_config_class — strict subclass types defer to a more specific converter, allowing yet-to-be-migrated subclasses (e.g., Mixtral on Llama) to call super().export_config() without tripping the parent's check on fields the parent doesn't know about. The walker is implicit: NestedConfigConverter / DispatchConfigConverter call the public import_config/export_config on the sub-converter class so subclass overrides participate, rather than a private path that bypasses them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pilot of the new ConfigSectionConverter framework. Each Llama section converter (Normalization/MLP/Attention/Block/Embeddings/Head/BaseModel) now declares its conversion via _create_config_converters() instead of imperative import_config/export_config bodies. Weight side is unchanged. Notable shape decisions: - LlamaDecoderConverter stays as a regular (imperative) class because Fixed/Pattern block-sequence dispatch doesn't lend itself to the declarative shape. LlamaBaseModelConverter wires it in via a small CustomConfigConverter; subclasses (Mistral, Qwen2, MTP-Llama, ...) continue to plug in different block converters via block_converter_class. - _check_config is retained as an overridable classmethod and called from the linear_layers CustomConfigConverter, so Qwen2 can keep its asymmetric Q/K/V bias rule without re-implementing the export. - IgnoredConfigConverter is used for ParameterConfig sub-fields with no architecture-significant content (weight, output_weight, word_embeddings), and for prediction_heads (which Llama HF doesn't expose; subclass MTP-Llama adds it imperatively). - peft uses CustomConfigConverter to assert NoPeftConfig on export. Llama HF format cannot represent PEFT, so a configured LoRA now fails loudly rather than being silently dropped. - Rotary remains in CustomConfigConverter — the v4/v5 transformers split (rope_theta/rope_scaling vs. rope_parameters) and three rope_type variants don't fit pure rename primitives. Verified with live round-trips of Llama-3, Qwen2, Mistral, Mixtral, and MTP-Llama HF configs, plus tests/models/test_checkpoint.py for all GPT formats (139 passed, 0 failed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `_validate_export(cls, config)` classmethod hook on `ConfigSectionConverter`, called automatically from `export_config` after the architecture-coverage check. Replaces five `CustomConfigConverter`-as-validator blocks (`linear_layers`/`layers` in attention and MLP, `position_embeddings` in embeddings, `peft` in base model, plus the `_check_config` chain on attention) with `IgnoredConfigConverter` for field-claiming + small `_validate_export` overrides. Mistral and Qwen2 rename their `_check_config` overrides accordingly; Pixtral's imperative export updates its `cls._check_config(config)` call site. Also addresses several reviewer-flagged correctness/cleanup items: - Drop the half-removed `parent_context` parameter from every primitive's `import_to` signature (and from `CustomConfigConverter`'s `import_fn`). It was unreachable through the walker. - `_check_architecture_coverage` now reads `cls.fast_llm_config_class` directly instead of `getattr(..., None)`, surfacing missing class-attribute declarations as `AttributeError` rather than silently disabling the safety net. - Drop the unused `hf_paths` parameter from `CustomConfigConverter.__init__`. There is no symmetric HF-side coverage check yet, so the field was cosmetic. - Add a TODO note in `_check_architecture_coverage` documenting that the `MoEMLPConfig`/`MambaConfig`/etc. safety net is gated on later migrations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The dict of named per-block configs is unambiguously architecture metadata; without an explicit hint it defaulted to `unknown`, hiding it from the architecture-coverage check used by declarative checkpoint converters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two additions, both required by Apriel2's nested HF schema: - `NestedConfigConverter` gains an optional `hf_path` kwarg. When set, the sub-converter's output is placed under that nested key instead of being flat-merged. Existing flat-merge behavior is unchanged when `hf_path` is omitted. - New `TypedDictContainerConfigConverter` for `dict[str, Config]` fields where each entry is round-tripped through a per-class section converter. Polymorphic dispatch via the entry's runtime type on export and the HF discriminator on import. A homogeneous mode (single registered class with `hf_type_name = None`) skips the discriminator entirely. Both `DispatchConfigConverter` and `TypedDictContainerConfigConverter` now also inject the Fast-LLM `dynamic_type_name` discriminator into the imported sub-dict so the parent's `from_dict` dispatches to the right `Config` subclass without a separate ConstantImport. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Stress-tests the framework's polymorphic dispatch and typed-dict support: Apriel2's HF schema is nested (`decoder.block.mixer.{...}`, `head.normalization`, `mixers.{name}`) and the mixer field is heterogeneously polymorphic (Attention/Mamba/StochasticMixer/GDN/KDA). Migrated converters: per-mixer (Attention/Mamba/GDN/KDA), the StochasticMixer container (driven by TypedDictContainer over a leaf-mixer registry), per-normalization (RMS/LayerNorm/NoNorm), MLP, Block, Fixed/Pattern decoder variants (selected by Dispatch on runtime BlockSequenceConfig type), Head, and BaseModel. The imperative weight-side `get_converters` methods are preserved unchanged so the multimodal Apriel2 converter (which inherits from the text-only one) keeps working without modification. PatternDecoder's `blocks` dict uses the homogeneous mode of TypedDictContainer (single-class registry, no discriminator). The attention rotary-type translation (default ↔ mistral_1d) and Mamba's auxiliary HF fields (d_conv, conv_bias, dt_proj_bias derived from linear-config bias flags) remain on `CustomConfigConverter` since they're shape-changing transforms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…primitives Each format inherits Llama's `_create_config_converters` and replaces only the fields that diverge: * Mistral: ConstantImportConfigConverter pinning `add_linear_biases=False` for attention and MLP (HF format has no `attention_bias`/`mlp_bias`); rename `window_size` <-> `sliding_window`. * Qwen2: ConstantImportConfigConverter for `add_linear_biases`; CustomConfigConverter for `head_size` (no `head_dim` HF field, derive on import); CustomConfigConverter for per-layer biases (always Q/K/V=True, dense=False); the head_dim relationship `heads * head_size == hidden_size` moves to `_validate_export` on the base-model converter; the use_mrope guard moves to `import_config`. * MTP-Llama: RenameConfigConverter for `prediction_heads` (Llama blanket-ignores it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`MixtralMLPConverter` switches its `fast_llm_config_class` to `MoEMLPConfig` so the architecture-coverage check sees MoE-specific fields. The config-side overrides: * `add_linear_biases` -> ConstantImportConfigConverter (Mixtral has no `mlp_bias`). * `experts` <-> `num_local_experts` and `experts_per_token` <-> `num_experts_per_tok` via RenameConfigConverter. * `shared_experts=0` and `routing=topk` pinned via ConstantImportConfigConverter so they round-trip cleanly without an HF representation. * `router` covered by IgnoredConfigConverter (Mixtral's gate is a default `LinearConfig`). The Fast-LLM dynamic-type discriminator (`type: "moe"`) is injected via an `import_config` override since the MLP is wrapped via `NestedConfigConverter` rather than `DispatchConfigConverter`. Diffusion-Dream and Diffusion-Llama need no migration: they only override `architecture`, `get_transformers_configuration_class`, and `_export_config` (auto_map). They inherit the declarative converters from their parents (Qwen2 and Llama). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…itives `AprielMambaConverter`, `GatedDeltaNetConverter`, and `KimiDeltaAttentionConverter` become `ConfigSectionConverter` subclasses with their HF-side fields nested under the appropriate HF subkey (`ssm_cfg` for Mamba, `linear_attn_config` for GDN/KDA). Mamba's three sibling-default fields (`d_inner`, `d_xb`, `dt_rank`) read the HF root's `hidden_size` directly via `DefaultConfigConverter.hf_default_fn` / `CustomConfigConverter`, removing the need for an explicit `parent_context` plumbing through the framework. The per-layer convolution and dt biases use `CustomConfigConverter` to pick up the mixer-wide `add_linear_biases` fallback when unset; the existing `_check_config` per-layer assertions move to `_validate_export`. `AprielBlockConverter` (the per-block dispatcher) and `AprielDecoderConverter` (the `hybrid_block_layout` driver) stay imperative because Apriel's HF format encodes the mixer type in a parent-level list rather than a per-block discriminator, which `DispatchConfigConverter` doesn't model. The `type: "mamba"`/`"gdn"`/`"kda"` Fast-LLM discriminator is injected via a one-line `import_config` override on each leaf converter (same pattern Mixtral uses). The HF format has no test coverage in `tests/models/test_checkpoint.py` or `tests/models/test_hf_roundtrip.py`, so verification was a synthesized live round-trip covering each mixer leaf plus a hybrid attention+Mamba pattern decoder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…larative primitives `PixtralNormalizationConverter` collapses to a single `_create_config_converters` override that pins `epsilon=1e-5` via `ConstantImportConfigConverter` (asserts on export, injects on import; no HF write). `PixtralEmbeddingsConverter` becomes a `ConfigSectionConverter` with declarations for `patch_height` (rename to `patch_size`), `patch_width` (mirror `patch_size` on import), `num_channels` (export-only constant 3), nested `normalization`, and an `IgnoredConfigConverter` for `patch_embeddings`. The `patch_height == patch_width` and `patch_embeddings.bias.enabled in (None, False)` checks move to `_validate_export`. The remaining Llava and Apriel2 multimodal converters stay imperative: they're cross-section aggregators (vision_config + text_config + top-level merge) whose shape doesn't fit a single ConfigSectionConverter, often with parent-context dependencies (e.g., the adapter's intermediate_size derives from the text model's hidden_size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`CopyWeightConverter` was defined in `external.py` but never instantiated; deleted. `QueryWeightConverter` was a no-op identity (its `export_weight`/`import_weight` just unwrap and rewrap); replaced with the default `WeightConverter` at all three call sites (Llama, Qwen2, Apriel2 attention) and removed the redundant `config` arg. The broader weight-side refactor (declarative `WeightConverter` primitives, walker-driven `drop_on_export` removal) is deferred — out of scope for this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Fix asymmetric round-trip in `Apriel2MambaConverter`: the `aux` declaration's import_fn now reads `d_conv` / `conv_bias` / `dt_proj_bias` back into `convolution_layer.kernel_size`, `convolution_layer.bias.enabled`, and `dt_layer.bias.enabled`. Previously these HF fields were dropped on import, which silently masked HF conv1d/dt_proj bias weights when they diverged from the mixer-wide `add_linear_biases` flag (parallel to the apriel.py mamba migration earlier in this PR). - Drop the stale TODO from `_check_architecture_coverage`'s docstring (the migrations it referred to have all landed in this PR); reword the surrounding comment to describe the current strict-subtype handling. - Combine adjacent f-strings in `DispatchConfigConverter`'s import-error message. - Hoist `StochasticMixerSamplingStrategy` to the module-level import in `apriel2.py`; it was being re-imported on every `_create_config_converters` call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Recursive architecture-coverage walker (item 1): the section-level check now collects every architecture-hint path under the active config tree and matches each against the declarations. Recursive primitives (Nested/Dispatch/TypedDictContainer/Ignored, plus Custom/ImportOnly when the author opts in) cover whole subtrees by prefix; non-recursive ones must list every leaf they consume. Fixes the silent-drop class of bug previously masked for any sub-config field claimed by a flat CustomConfigConverter. - Apriel2 rotary export bug fix (motivating leak for item 1): the export now emits the Llama3/Yarn scale parameters that round-trip via the pass-through import, instead of silently dropping them. - Pixtral attention migrated to declarative form (item 3): _create_config_converters overrides instead of an imperative export_config that bypassed the coverage check. - Apriel2 weight side cleanup (items 5, 6, 12): Apriel2MLPConverter owns its weight converters and the block delegates; the imperative Apriel2DecoderConverter is gone, replaced by per-shape get_converters on Apriel2FixedDecoderConverter / Apriel2PatternDecoderConverter dispatched via APRIEL2_DECODER_REGISTRY. - ImportOnlyConfigConverter primitive (item 11) collapses three asymmetric CustomConfigConverter sites in qwen2.py and llava.py. - Helper consolidation: drop external.py's _get_nested/_has_nested in favour of fast_llm.config.get_nested_dict_value (item 7); share assert_no_peft between Llama and Apriel2 base-model converters (item 10). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Subtree drops are now visible at the declaration site (review item 1). Five Custom converters previously claimed a sub-config subtree via ``recurses=True`` while only round-tripping a fraction of its architecture leaves; each is now non-recursive (lists every leaf it actually round-trips) with sibling ``IgnoredConfigConverter`` entries for the leaves the format drops on purpose. Sites: Apriel mamba ``convolution_layer`` and ``dt_layer``, Apriel2 GDN ``convolution_layer``, Apriel2 KDA ``convolution_layer`` and ``normalization``. - Architecture-coverage walker now descends into ``dict[str, Config]`` and list/tuple-of-Config fields (item 2). Previously masked by ``TypedDictContainerConfigConverter.recurses=True``; the walker now matches what the docstring claims. - Coverage error gains a hint when missing paths share a top-level prefix that is claimed non-recursively (item 3 — message half only): suggests Nested/Dispatch or ``recurses=True`` on Custom/ImportOnly. No new ``recurses`` kwarg on the base primitives. - Single ``effective_bias(layer_config, default)`` helper in llama.py replaces three near-duplicates (item 4): ``_resolve_bias_enabled`` in apriel.py, ``_get_effective_bias`` in apriel2.py, and the inline ternary in ``Apriel2MLPConverter``. - Apriel2 decoder dispatch lookup lifted into module-level ``get_apriel2_decoder_converter(decoder)`` (item 6); used by both the text and multimodal base-model converters. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Apriel2 decoder converters expose `block_converter_class` ClassVar so subclasses can swap the block converter, mirroring the LlamaDecoderConverter polymorphism pattern. * `_create_config_converters` is memoized via `functools.cache` (keyed by cls), so per-class declarations are built once. Convert two `out = super(); out[k] = v` mutation patterns (qwen2, llava) to spread+new-dict so the cached parent dict is never mutated. * `NestedConfigConverter` auto-injects the HF `type` discriminator from the target converter's `hf_type_name`, mirroring `DispatchConfigConverter`/`TypedDictContainer`. Drops a manual `ConstantExportConfigConverter` from `Apriel2MLPConverter`. * Move architecture-coverage check to `tests/models/test_converters.py`, parametrized per-format. Walks each `HuggingfaceStateDictCheckpointHandler.base_model_converter_class` through the modular converter tree (Nested/Dispatch/TypedDict + `*_converter_class` ClassVars) and runs `check_architecture_coverage` on each `ConfigSectionConverter` node. The per-export runtime invocation is removed. * Same test verifies `OptionalConfigConverter` sentinels match the resolved field default — catches silent round-trip drift if a Fast-LLM default changes. * Two latent bugs surfaced and fixed by the new test: * `apriel.py` GDN/KDA converters were missing `convolution_layer` architecture claims. * `Apriel2MambaConverter.d_xb`/`dt_rank` misused `OptionalConfigConverter` (sentinel=None on a non-Optional int) - converted to `RenameConfigConverter`. Deferred to follow-up commit: HF-side coverage check on every import (item 10) - needs `hf_paths` audit across ~20 Custom/ImportOnly call sites and a flat-merge-aware walker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Symmetric counterpart to the architecture-coverage check (already a test). Walks the HF config dict at the import boundary and raises on any key not consumed by some declaration in the converter tree. Catches transformers-version drift, manual edits, and corrupted configs at the point of import rather than as cryptic downstream failures. * ``ConfigConverter`` primitives gain a recursive ``_consumed_hf_paths`` walker. Nested/ Dispatch/TypedDictContainer with a fixed ``hf_path`` claim it as a subtree prefix; their flat-merge variants (``hf_path=None``) pull the sub-converter's claims up to the current level so a parent's check sees them. * ``CustomConfigConverter`` / ``ImportOnlyConfigConverter`` gain an ``hf_paths`` kwarg; every existing call site is audited and populated. ``IgnoredConfigConverter`` gains an ``hf_paths`` kwarg used for HF-only fields Fast-LLM intentionally does not consume (Mixtral router toggles, Qwen2 sliding-window machinery, Apriel2's default-injected ``embeddings`` subdict from ``Apriel2TextConfig``). * ``HuggingfaceStateDictCheckpointHandler`` runs the check from ``_import_config`` against the base-model converter. A class-level allowlist covers transformers' generic ``PretrainedConfig`` fields and inference-only metadata that's always permitted. The ``Apriel2`` text handler's override is updated to call the shared ``_check_hf_coverage`` helper. Non-``ConfigSectionConverter`` base-model converters (Llava aggregators) skip the check transparently. * ``LlamaBaseModelConverter``'s decoder Custom - which wraps the imperative ``LlamaDecoderConverter`` - auto-extends its ``hf_paths`` from the block converter's ``_consumed_hf_paths``, so Mistral/Mixtral/Qwen2/MTPLlama/Apriel inherit correct coverage. ``AprielBlockConverter`` (per-block-type dispatcher, also imperative) gets its own ``_consumed_hf_paths`` that unions across registered per-mixer block converters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Apriel2BlockConverter._validate_export asserts type(config.mlp) is MLPConfig, restoring the pre-PR rejection of MoEMLPConfig that NestedConfigConverter would otherwise silently descend through (dropping experts/routing/router). - _consumed_hf_paths now expands a nested sub-converter's claims under its hf_path prefix (NestedConfigConverter/DispatchConfigConverter with hf_path set) so check_hf_coverage descends and flags unknown keys deep inside apriel2's head/decoder, llava's vision_config, etc. - Pin prediction_heads to 1 in Llama and Apriel2 head converters via ConstantImportConfigConverter so non-default values fail on export instead of silently dropping (MTP-Llama overrides the entry with Rename). - Document the cache-mutation hazard on _create_config_converters: subclasses must spread the parent's dict, never mutate it in place. - Narrow Apriel2BaseModelConverter's HF embeddings Ignored to the single injected leaf so future transformers fields in the same subdict trip the coverage check. - Tighten Mixtral router Ignored comment to record the structural rationale (router.weight has no architecture sub-fields, so the blanket claim is equivalent to the narrowest possible non-recursive claim). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Surface cleanups from the fine-pass review: rename ``cur`` → ``current`` in ``_get_attr_path``, merge an unintentionally split f-string in the ``DispatchConfigConverter`` error path, switch bare ``return`` to ``pass`` in empty ``-> None`` converter bodies, type-annotate ``_per_layer_bias_export`` and ``get_apriel2_decoder_converter`` (dropping a redundant forward-ref quote), and replace ``<->`` with ``↔`` in the remaining converter docstrings for consistency across the migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Round 6 picks up one latent correctness bug, consolidates duplicated declarations into framework primitives, and tidies several surface items. * ``Apriel2HeadConverter._validate_export`` now asserts ``RMSNormalizationConfig``: the config side dispatches normalization through ``APRIEL2_NORM_REGISTRY`` while the weight side hardcoded RMS, so a LayerNorm/NoNorm head would have silently dropped its bias on convert. * ``ConfigSectionConverter.import_config`` injects ``{"type": <dynamic_type_name>}`` from ``fast_llm_config_class`` automatically, removing the redundant injection from ``NestedConfigConverter`` / ``TypedDictContainerConfigConverter`` and collapsing four hand-rolled overrides (Apriel mamba/gdn/kda + Mixtral moe). * Deleted ``MTPLlamaDecoderConverter`` — its overrides were byte-identical to the parent's after the migration, with the only diff being a Pattern restriction that the parent now handles correctly through the multi-block-equality branch. * Extracted ``_per_layer_bias_converter`` and ``_apriel2_conv_kernel_converter`` helpers in apriel2.py to collapse pairs of byte-identical CustomConfigConverter declarations. * ``AprielBlockConverter._consumed_hf_paths`` gets ``@functools.cache`` for parity with the base ``ConfigSectionConverter._consumed_hf_paths``. * ``effective_bias`` typed as ``AffineLinearConfig``; ``NoPeftConfig`` import moved to top of llama.py (the module is not a config module subject to the heavy-import rule); stale ``# TODO: Peft?`` removed. * CLAUDE.md naming convention clarified: single underscore covers non-public (private or protected), matching the project's actual usage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Apriel2VisionAttention/Block/MLP/Encoder/Embeddings/Adapter/Model and the top-level Apriel2MultimodalBaseModelConverter become ConfigSectionConverter subclasses. The vision branch keeps inheriting weight-side get_converters from Pixtral/Llava bases via MRO; only the config side is declarative. Cross-section rotary metadata (patch_size/max_image_size derived from embeddings.patch_height) is injected at the vision-model level via a Custom, which is the smallest scope that sees both halves. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

LlavaVisionAdapter/VisionModel/Base converters become ConfigSectionConverter subclasses. The adapter is declared at the LlavaBaseModelConverter scope (not inside VisionModelConverter) because its intermediate_size derives from text_config.hidden_size — a cross-section reference reachable only at the top-level HF dict. PixtralAttentionConverter's head_size declaration changes from DefaultConfig (emits head_dim) to ImportOnly (derives from hidden_size / num_attention_heads). The previous head_dim popping in the imperative LlavaVisionModelConverter is replaced by a head_size invariant check on the new declarative converter's _validate_export. Apriel2VisionAdapterConverter loses its MRO trick (ConfigSectionConverter + LlavaVisionAdapterConverter) and inherits cleanly from Llava — now that Llava is also a ConfigSectionConverter, the trick would produce an inconsistent MRO. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mistral/Qwen2/Mixtral DecoderConverter subclasses disappear; their BaseModelConverters now plug in block_converter_class directly, and LlamaBaseModelConverter inlines the Fixed/Pattern dispatch (config + weight sides) parameterised by that ClassVar. LlamaDecoderConverter stays as an imperative helper for the cases that don't fit the common pattern: Pixtral's vision encoder dispatch (Llava) and Apriel's per-position hybrid layout dispatch. AprielBaseModelConverter overrides the "decoder" declaration to delegate to AprielDecoderConverter (held via a new apriel_decoder_converter_class ClassVar) instead of using the inlined Llama dispatch. Qwen2BaseModelConverter.import_config (one-line MRoPE guard) becomes a declarative ImportOnlyConfigConverter claiming use_mrope and asserting on import. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

transformers' PretrainedConfig.to_dict() populates _name_or_path/architectures/ torch_dtype/transformers_version on nested configs (vision_config is a PretrainedConfig under transformers.LlavaConfig), so a round-tripped save carries these keys back through the HF coverage check. The top-level _HF_METADATA_ALLOWLIST only matches single-key paths, so we mark them explicitly ignored inside LlavaVisionModelConverter. Fixes test_conversion[llava] which failed on "unknown key 'vision_config.architectures'". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The top-level _HF_METADATA_ALLOWLIST covers generic PretrainedConfig fields (architectures, torch_dtype, transformers_version, output_hidden_states, …), but the recursive coverage walker only matched it on single-key paths. After a round-tripped save, transformers populates the same metadata on nested sub-configs like Llava's vision_config, which then trip the walker. Match the allowlist against any segment of the path. Revert the previous local-scope ignore claim on LlavaVisionModelConverter, which only patched a subset of the keys and didn't help apriel2 or future nested formats. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

transformers.LlavaConfig.from_dict(...).save_pretrained(...) round-trips the config through transformers.LlavaConfig and PixtralVisionConfig, which fill in many model-specific defaults Fast-LLM doesn't consume (head_dim, image_size, layer_norm_eps, initializer_factor, projection_dim, vocab_size in vision_config; image_seq_length, tie_word_embeddings at the top level). Add IgnoredConfigConverter claims for these so the recursive HF coverage check accepts round-tripped saves. tie_word_embeddings is intentionally claimed only at the top level — Fast-LLM tracks it inside text_config via Llama's tied_embedding_weight declaration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Brings in main's new architecture fields (Gemma 4 + stochastic-mixer oversampling) and resolves the conflicts with the declarative converter migration: AttentionConfig.{query,key,value}_norm, shared_key_value MLPConfig.{pre,post}_norm DecoderBlockConfig.{pre,post}_{mixer,mlp}_normalization, output_scale LanguageModelEmbeddingsConfig.embedding_scale LanguageModelHeadConfig.final_logit_softcap MoEMLPConfig.{router_normalization, router_scale, router_input_scale, router_per_expert_scale} Each new field is claimed by ConstantImportConfigConverter (asserts the Fast-LLM default on export, injects the default on import) on every section converter whose fast_llm_config_class is the field's container. OptionalParameterConfig fields (output_scale, router_scale, router_per_expert_scale) are claimed via IgnoredConfigConverter plus a ``_validate_export`` assertion mirroring main's imperative ``not enabled`` check. The Gemma 4 imperative converter (``conversion/gemma4.py``) is brought in as-is, with its now-removed ``QueryWeightConverter`` import dropped to keep the module importable; a follow-up commit ports it to the declarative API. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Migrate the top-level Gemma 4 base-model converter from the imperative import_config/export_config shape to the declarative ConfigSectionConverter API used by the rest of the codebase. The embeddings/decoder/head sections remain imperative helpers — each is wrapped in a recursing CustomConfigConverter because Gemma 4's HF format cross-references hidden_size (embeddings, MoE router scale) and merges two block variants (sliding_attention / full_attention) into one HF dict, neither of which fits the standard per-section decomposition. The previously-imperative top-level guards (PLE, KV sharing, double-wide MLP, bidirectional attention) become declarative ConstantExportConfigConverter / CustomConfigConverter entries that preserve the rejected-feature checks while running through the same HF-coverage walker as every other format. ``vocab_size_per_layer_input`` is claimed via IgnoredConfigConverter so the coverage walker accepts the value transformers fills on save_pretrained round-trip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jlamypoirier and others added 3 commits May 5, 2026 18:33

jlamypoirier force-pushed the jlp_simplify_conversion branch from 5567a71 to 0c406db Compare May 5, 2026 22:33

jlamypoirier and others added 16 commits May 6, 2026 07:14

jlamypoirier changed the title ~~Declarative checkpoint config conversion (Llama pilot)~~ Declarative checkpoint config conversion May 12, 2026

jlamypoirier and others added 9 commits May 11, 2026 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Declarative checkpoint config conversion#508

Declarative checkpoint config conversion#508
jlamypoirier wants to merge 28 commits into
mainfrom
jlp_simplify_conversion

jlamypoirier commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in

Main merge

Notable shape decisions

What's deferred

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlamypoirier commented May 5, 2026 •

edited

Loading