Skip to content

Declarative checkpoint config conversion#508

Open
jlamypoirier wants to merge 28 commits into
mainfrom
jlp_simplify_conversion
Open

Declarative checkpoint config conversion#508
jlamypoirier wants to merge 28 commits into
mainfrom
jlp_simplify_conversion

Conversation

@jlamypoirier
Copy link
Copy Markdown
Collaborator

@jlamypoirier jlamypoirier commented May 5, 2026

Summary

Replaces the imperative import_config / export_config methods on checkpoint converter classes with declarative ConfigConverter primitives keyed off FieldHint.architecture. Two recursive coverage walkers — one over Fast-LLM architecture fields, one over the HF dict — guarantee that every architecture-significant field and every HF dict key is accounted for on import and export. Unsupported feature detection is driven from a single source instead of scattered per-converter asserts.

What's in

  • Primitives (fast_llm/engine/checkpoint/external.py): Rename, ConstantImport, ConstantExport, Optional, Default, Ignored, ImportOnly, Custom, plus the recursive Nested / Dispatch / TypedDictContainer. Each declares the Fast-LLM paths and HF paths it consumes so the walkers can verify coverage on both sides.
  • ConfigSectionConverter ABC: cached _create_config_converters per class (subclasses extend the parent's dict by re-declaring keys), _validate_export hook for format-specific cross-field invariants, and recursive check_architecture_coverage / check_hf_coverage.
  • Architecture-hint reclassification (eight fields): attention dense_layer and softmax_scale_power, MLP activation, MoE router, Llama3 / Yarn rotary scaling parameters, StochasticMixer.main_mixer_name, vision patch_height / patch_width, and PatternBlockSequenceConfig.blocks. These are what the coverage walkers now require to be consumed.
  • GPT-side migration: llama, mistral, qwen2 (including the MRoPE guard declaratively as an ImportOnlyConfigConverter), mtp_llama, mixtral, diffusion_dream, diffusion_llama, apriel2 (text), the apriel hybrid-SSM mixers (Mamba, GatedDeltaNet, KimiDeltaAttention), and gemma4 — the latter as a declarative shell over imperative cross-block helpers (Gemma 4's HF format merges sliding + full attention into one config and cross-references hidden_size).
  • Multimodal migration: Llava (vision adapter, vision model, base model) and multimodal Apriel2 (vision attention/block/MLP/encoder/embeddings/adapter/model and the top-level multimodal base) are now ConfigSectionConverters. The Llava adapter is declared at the base-model scope so its intermediate_size import can cross-reference text_config.hidden_size. Apriel2 vision injects patch_size / max_image_size rotary metadata at the vision-model scope (the smallest scope that sees both embeddings.patch_height and the encoder rotary subtree).
  • Decoder dispatcher flattening: Mistral / Qwen2 / Mixtral DecoderConverter subclasses deleted. LlamaBaseModelConverter inlines the Fixed/Pattern dispatch parameterised by block_converter_class. The imperative LlamaDecoderConverter helper class survives because Pixtral's vision encoder dispatch and Apriel's per-position hybrid_block_layout dispatch don't fit the inlined common case; AprielBaseModelConverter overrides the "decoder" declaration to delegate to AprielDecoderConverter via a new apriel_decoder_converter_class ClassVar.
  • HF coverage allowlist propagation: check_hf_coverage now matches the allowlist against any segment of the walked path (not just top-level keys). transformers' generic PretrainedConfig metadata (architectures, torch_dtype, transformers_version, …) is now accepted at any depth — relevant for Llava's vision_config and similar nested sub-configs that transformers auto-populates on round-trip.
  • Static test (tests/models/test_converters.py): walks every registered handler's converter tree, runs check_architecture_coverage on each section node, and validates that every OptionalConfigConverter sentinel matches the field's resolved default (so an exported sentinel-equal value can't silently drift on re-import).

Main merge

  • Brings in the new architecture fields landed since branch cut: AttentionConfig.{query,key,value}_norm + shared_key_value, MLPConfig.{pre,post}_norm, DecoderBlockConfig.{pre,post}_{mixer,mlp}_normalization + output_scale, LanguageModelEmbeddingsConfig.embedding_scale, LanguageModelHeadConfig.final_logit_softcap, MoEMLPConfig.{router_normalization, router_scale, router_input_scale, router_per_expert_scale}. Each new field is claimed on every section converter whose fast_llm_config_class is the field's container via ConstantImportConfigConverter (asserts the Fast-LLM default on export, injects the default on import); OptionalParameterConfig fields (output_scale, router_scale, router_per_expert_scale) are claimed via IgnoredConfigConverter + a _validate_export not enabled assertion to preserve the imperative reject-non-default behaviour.

Notable shape decisions

  • Coverage check is type-strict (type(config) is cls.fast_llm_config_class). Strict subclasses defer to a more specific converter, which kept partially migrated callers working through super() during the multi-step migration.
  • HF-side coverage walker consumes prefixes registered by every declaration, descending through Nested / Dispatch sub-converters under their hf_path. IgnoredConfigConverter(hf_paths=...) is the explicit opt-out for HF-only fields with no Fast-LLM counterpart (Mixtral router-runtime toggles, Qwen2 sliding-window, …) and for the model-specific defaults transformers' Pixtral/Llava fill on round-trip (head_dim, image_size, layer_norm_eps, projection_dim, vocab_size, image_seq_length, tie_word_embeddings at the Llava level).
  • _create_config_converters is @functools.cached. Subclasses must return a fresh dict ({**super()._create_config_converters(), ...}) — mutating the parent's returned dict would corrupt its cache entry. Documented on the base method.
  • IgnoredConfigConverter is recursive (recurses=True). Used for sub-configs with no architecture leaves (ParameterConfig, Mixtral's router sub-config) and for HF-only fields. Non-architecture fields (lr_scale, apply_peft, initialization sub-config) are by design not part of the HF round-trip; Fast-LLM keeps them on the in-memory config independently.

What's deferred

  • Weight-converter declarative refactor. The weight side still uses today's WeightConverter subclasses (SplitWeightConverter, MLPLayer2Converter, KeyValueWeightConverter, …) and the per-converter get_converters classmethods.
  • Gemma 4 / Apriel block-level declarative. Gemma 4's Gemma4BlockConverter / Gemma4DecoderConverter and Apriel's AprielBlockConverter / AprielDecoderConverter remain imperative — Gemma 4 because the HF format merges sliding + full attention into one config and cross-references hidden_size (doesn't fit per-section decomposition), Apriel because hybrid_block_layout is a positional list discriminator (would need a new ListDispatchConfigConverter primitive).
  • A handful of structural follow-ups recorded for future review rounds: IgnoredConfigConverter as a _ignored_fields ClassVar, get_converters signature uniformity, LlamaDecoderConverter as a ConfigSectionConverter, IgnoredConfigConverter default-round-trip maintenance test.

Test plan

  • pytest -v -n 8 tests/models/test_checkpoint.py tests/models/test_hf_roundtrip.py tests/models/test_converters.py: 271 passed, no failures (gemma4 dependency group now included).
  • pytest -v -n 8 fast_llm_external_models/tests/: 2109 passed, 42 skipped (separate invocation per CLAUDE.md).
  • Manual smoke: fast-llm convert --input.format <fmt> --input.path <ref> --output.format <fmt> --output.path <tmp>; reload both and compare configs.

🤖 Generated with Claude Code

jlamypoirier and others added 3 commits May 5, 2026 18:33
Eight config fields whose values directly affect model architecture were
tagged as feature/core/(none). They drive the upcoming declarative-converter
coverage check, which uses FieldHint.architecture as the source of truth
for "must be handled by every checkpoint format".

- AttentionConfig.dense_layer (output projection presence)
- AttentionConfig.softmax_scale_power (attention scaling)
- MLPConfig.activation (forward-pass activation type)
- MoEMLPConfig.router (routing weights drive token assignment)
- Llama3RotaryConfig: scale_factor, low_frequency_factor,
  high_frequency_factor, original_context_length
- YarnRotaryConfig: scale_factor, attention_factor, beta_fast, beta_slow,
  original_context_length
- StochasticMixerConfig.main_mixer_name (selects inference mixer)
- PatchEmbeddingsConfig.patch_height/patch_width (input tokenization)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reintroduces the declarative config-conversion shape that pre-dated PR #362,
applied within the post-#362 modular per-section structure. Replaces the
imperative import_config/export_config bodies with a small set of named
primitives and a recursive walker driven by per-section declarations.

Primitives in fast_llm.engine.checkpoint.external:
- RenameConfigConverter — 1:1 path rename
- ConstantExportConfigConverter — write constant on export, assert on import
- ConstantImportConfigConverter — assert on export, inject on import
- DefaultConfigConverter — rename with HF-side fallback
- OptionalConfigConverter — emit/import only when non-sentinel
- IgnoredConfigConverter — declare a field as intentionally not converted
- CustomConfigConverter — escape hatch for cross-field transforms
- NestedConfigConverter — recurse into a fixed-typed sub-config; flat-merges
  HF output into the parent (transformer side is assumed flat)
- DispatchConfigConverter — runtime type dispatch for polymorphic sub-configs

ConfigSectionConverter is the per-Fast-LLM-class converter base. Subclasses
declare their conversion via _create_config_converters() and inherit
import_config/export_config concretely. The architecture-coverage check fires
only when type(config) exactly matches the converter's declared
fast_llm_config_class — strict subclass types defer to a more specific
converter, allowing yet-to-be-migrated subclasses (e.g., Mixtral on Llama)
to call super().export_config() without tripping the parent's check on
fields the parent doesn't know about.

The walker is implicit: NestedConfigConverter / DispatchConfigConverter
call the public import_config/export_config on the sub-converter class so
subclass overrides participate, rather than a private path that bypasses
them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pilot of the new ConfigSectionConverter framework. Each Llama section
converter (Normalization/MLP/Attention/Block/Embeddings/Head/BaseModel)
now declares its conversion via _create_config_converters() instead of
imperative import_config/export_config bodies. Weight side is unchanged.

Notable shape decisions:
- LlamaDecoderConverter stays as a regular (imperative) class because
  Fixed/Pattern block-sequence dispatch doesn't lend itself to the
  declarative shape. LlamaBaseModelConverter wires it in via a small
  CustomConfigConverter; subclasses (Mistral, Qwen2, MTP-Llama, ...)
  continue to plug in different block converters via block_converter_class.
- _check_config is retained as an overridable classmethod and called from
  the linear_layers CustomConfigConverter, so Qwen2 can keep its
  asymmetric Q/K/V bias rule without re-implementing the export.
- IgnoredConfigConverter is used for ParameterConfig sub-fields with no
  architecture-significant content (weight, output_weight, word_embeddings),
  and for prediction_heads (which Llama HF doesn't expose; subclass
  MTP-Llama adds it imperatively).
- peft uses CustomConfigConverter to assert NoPeftConfig on export. Llama
  HF format cannot represent PEFT, so a configured LoRA now fails loudly
  rather than being silently dropped.
- Rotary remains in CustomConfigConverter — the v4/v5 transformers split
  (rope_theta/rope_scaling vs. rope_parameters) and three rope_type
  variants don't fit pure rename primitives.

Verified with live round-trips of Llama-3, Qwen2, Mistral, Mixtral, and
MTP-Llama HF configs, plus tests/models/test_checkpoint.py for all GPT
formats (139 passed, 0 failed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jlamypoirier jlamypoirier force-pushed the jlp_simplify_conversion branch from 5567a71 to 0c406db Compare May 5, 2026 22:33
jlamypoirier and others added 16 commits May 6, 2026 07:14
Adds `_validate_export(cls, config)` classmethod hook on `ConfigSectionConverter`,
called automatically from `export_config` after the architecture-coverage check.
Replaces five `CustomConfigConverter`-as-validator blocks (`linear_layers`/`layers`
in attention and MLP, `position_embeddings` in embeddings, `peft` in base model,
plus the `_check_config` chain on attention) with `IgnoredConfigConverter` for
field-claiming + small `_validate_export` overrides. Mistral and Qwen2 rename
their `_check_config` overrides accordingly; Pixtral's imperative export updates
its `cls._check_config(config)` call site.

Also addresses several reviewer-flagged correctness/cleanup items:

- Drop the half-removed `parent_context` parameter from every primitive's
  `import_to` signature (and from `CustomConfigConverter`'s `import_fn`). It was
  unreachable through the walker.
- `_check_architecture_coverage` now reads `cls.fast_llm_config_class` directly
  instead of `getattr(..., None)`, surfacing missing class-attribute declarations
  as `AttributeError` rather than silently disabling the safety net.
- Drop the unused `hf_paths` parameter from `CustomConfigConverter.__init__`. There
  is no symmetric HF-side coverage check yet, so the field was cosmetic.
- Add a TODO note in `_check_architecture_coverage` documenting that the
  `MoEMLPConfig`/`MambaConfig`/etc. safety net is gated on later migrations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dict of named per-block configs is unambiguously architecture
metadata; without an explicit hint it defaulted to `unknown`, hiding
it from the architecture-coverage check used by declarative checkpoint
converters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two additions, both required by Apriel2's nested HF schema:

- `NestedConfigConverter` gains an optional `hf_path` kwarg. When set,
  the sub-converter's output is placed under that nested key instead
  of being flat-merged. Existing flat-merge behavior is unchanged when
  `hf_path` is omitted.
- New `TypedDictContainerConfigConverter` for `dict[str, Config]`
  fields where each entry is round-tripped through a per-class
  section converter. Polymorphic dispatch via the entry's runtime
  type on export and the HF discriminator on import. A homogeneous
  mode (single registered class with `hf_type_name = None`) skips
  the discriminator entirely.

Both `DispatchConfigConverter` and `TypedDictContainerConfigConverter`
now also inject the Fast-LLM `dynamic_type_name` discriminator into
the imported sub-dict so the parent's `from_dict` dispatches to the
right `Config` subclass without a separate ConstantImport.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stress-tests the framework's polymorphic dispatch and typed-dict
support: Apriel2's HF schema is nested (`decoder.block.mixer.{...}`,
`head.normalization`, `mixers.{name}`) and the mixer field is
heterogeneously polymorphic (Attention/Mamba/StochasticMixer/GDN/KDA).

Migrated converters: per-mixer (Attention/Mamba/GDN/KDA), the
StochasticMixer container (driven by TypedDictContainer over a
leaf-mixer registry), per-normalization (RMS/LayerNorm/NoNorm), MLP,
Block, Fixed/Pattern decoder variants (selected by Dispatch on
runtime BlockSequenceConfig type), Head, and BaseModel.

The imperative weight-side `get_converters` methods are preserved
unchanged so the multimodal Apriel2 converter (which inherits from
the text-only one) keeps working without modification.

PatternDecoder's `blocks` dict uses the homogeneous mode of
TypedDictContainer (single-class registry, no discriminator). The
attention rotary-type translation (default ↔ mistral_1d) and Mamba's
auxiliary HF fields (d_conv, conv_bias, dt_proj_bias derived from
linear-config bias flags) remain on `CustomConfigConverter` since
they're shape-changing transforms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…primitives

Each format inherits Llama's `_create_config_converters` and replaces only the
fields that diverge:
  * Mistral: ConstantImportConfigConverter pinning `add_linear_biases=False` for
    attention and MLP (HF format has no `attention_bias`/`mlp_bias`); rename
    `window_size` <-> `sliding_window`.
  * Qwen2: ConstantImportConfigConverter for `add_linear_biases`; CustomConfigConverter
    for `head_size` (no `head_dim` HF field, derive on import); CustomConfigConverter
    for per-layer biases (always Q/K/V=True, dense=False); the head_dim relationship
    `heads * head_size == hidden_size` moves to `_validate_export` on the base-model
    converter; the use_mrope guard moves to `import_config`.
  * MTP-Llama: RenameConfigConverter for `prediction_heads` (Llama blanket-ignores it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`MixtralMLPConverter` switches its `fast_llm_config_class` to `MoEMLPConfig` so the
architecture-coverage check sees MoE-specific fields. The config-side overrides:
  * `add_linear_biases` -> ConstantImportConfigConverter (Mixtral has no `mlp_bias`).
  * `experts` <-> `num_local_experts` and `experts_per_token` <-> `num_experts_per_tok`
    via RenameConfigConverter.
  * `shared_experts=0` and `routing=topk` pinned via ConstantImportConfigConverter so
    they round-trip cleanly without an HF representation.
  * `router` covered by IgnoredConfigConverter (Mixtral's gate is a default `LinearConfig`).
The Fast-LLM dynamic-type discriminator (`type: "moe"`) is injected via an `import_config`
override since the MLP is wrapped via `NestedConfigConverter` rather than `DispatchConfigConverter`.

Diffusion-Dream and Diffusion-Llama need no migration: they only override `architecture`,
`get_transformers_configuration_class`, and `_export_config` (auto_map). They inherit the
declarative converters from their parents (Qwen2 and Llama).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…itives

`AprielMambaConverter`, `GatedDeltaNetConverter`, and `KimiDeltaAttentionConverter` become
`ConfigSectionConverter` subclasses with their HF-side fields nested under the appropriate
HF subkey (`ssm_cfg` for Mamba, `linear_attn_config` for GDN/KDA).

Mamba's three sibling-default fields (`d_inner`, `d_xb`, `dt_rank`) read the HF root's
`hidden_size` directly via `DefaultConfigConverter.hf_default_fn` / `CustomConfigConverter`,
removing the need for an explicit `parent_context` plumbing through the framework. The
per-layer convolution and dt biases use `CustomConfigConverter` to pick up the mixer-wide
`add_linear_biases` fallback when unset; the existing `_check_config` per-layer assertions
move to `_validate_export`.

`AprielBlockConverter` (the per-block dispatcher) and `AprielDecoderConverter` (the
`hybrid_block_layout` driver) stay imperative because Apriel's HF format encodes the
mixer type in a parent-level list rather than a per-block discriminator, which
`DispatchConfigConverter` doesn't model. The `type: "mamba"`/`"gdn"`/`"kda"` Fast-LLM
discriminator is injected via a one-line `import_config` override on each leaf converter
(same pattern Mixtral uses).

The HF format has no test coverage in `tests/models/test_checkpoint.py` or
`tests/models/test_hf_roundtrip.py`, so verification was a synthesized live round-trip
covering each mixer leaf plus a hybrid attention+Mamba pattern decoder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…larative primitives

`PixtralNormalizationConverter` collapses to a single `_create_config_converters` override
that pins `epsilon=1e-5` via `ConstantImportConfigConverter` (asserts on export, injects
on import; no HF write). `PixtralEmbeddingsConverter` becomes a `ConfigSectionConverter`
with declarations for `patch_height` (rename to `patch_size`), `patch_width` (mirror
`patch_size` on import), `num_channels` (export-only constant 3), nested `normalization`,
and an `IgnoredConfigConverter` for `patch_embeddings`. The `patch_height == patch_width`
and `patch_embeddings.bias.enabled in (None, False)` checks move to `_validate_export`.

The remaining Llava and Apriel2 multimodal converters stay imperative: they're cross-section
aggregators (vision_config + text_config + top-level merge) whose shape doesn't fit a single
ConfigSectionConverter, often with parent-context dependencies (e.g., the adapter's
intermediate_size derives from the text model's hidden_size).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`CopyWeightConverter` was defined in `external.py` but never instantiated; deleted.
`QueryWeightConverter` was a no-op identity (its `export_weight`/`import_weight` just
unwrap and rewrap); replaced with the default `WeightConverter` at all three call
sites (Llama, Qwen2, Apriel2 attention) and removed the redundant `config` arg.

The broader weight-side refactor (declarative `WeightConverter` primitives, walker-driven
`drop_on_export` removal) is deferred — out of scope for this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix asymmetric round-trip in `Apriel2MambaConverter`: the `aux` declaration's import_fn
  now reads `d_conv` / `conv_bias` / `dt_proj_bias` back into `convolution_layer.kernel_size`,
  `convolution_layer.bias.enabled`, and `dt_layer.bias.enabled`. Previously these HF fields
  were dropped on import, which silently masked HF conv1d/dt_proj bias weights when they
  diverged from the mixer-wide `add_linear_biases` flag (parallel to the apriel.py mamba
  migration earlier in this PR).
- Drop the stale TODO from `_check_architecture_coverage`'s docstring (the migrations it
  referred to have all landed in this PR); reword the surrounding comment to describe
  the current strict-subtype handling.
- Combine adjacent f-strings in `DispatchConfigConverter`'s import-error message.
- Hoist `StochasticMixerSamplingStrategy` to the module-level import in `apriel2.py`;
  it was being re-imported on every `_create_config_converters` call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Recursive architecture-coverage walker (item 1): the section-level check now
  collects every architecture-hint path under the active config tree and matches
  each against the declarations. Recursive primitives
  (Nested/Dispatch/TypedDictContainer/Ignored, plus Custom/ImportOnly when the
  author opts in) cover whole subtrees by prefix; non-recursive ones must list
  every leaf they consume. Fixes the silent-drop class of bug previously masked
  for any sub-config field claimed by a flat CustomConfigConverter.

- Apriel2 rotary export bug fix (motivating leak for item 1): the export now
  emits the Llama3/Yarn scale parameters that round-trip via the pass-through
  import, instead of silently dropping them.

- Pixtral attention migrated to declarative form (item 3): _create_config_converters
  overrides instead of an imperative export_config that bypassed the coverage
  check.

- Apriel2 weight side cleanup (items 5, 6, 12): Apriel2MLPConverter owns its
  weight converters and the block delegates; the imperative Apriel2DecoderConverter
  is gone, replaced by per-shape get_converters on
  Apriel2FixedDecoderConverter / Apriel2PatternDecoderConverter dispatched via
  APRIEL2_DECODER_REGISTRY.

- ImportOnlyConfigConverter primitive (item 11) collapses three asymmetric
  CustomConfigConverter sites in qwen2.py and llava.py.

- Helper consolidation: drop external.py's _get_nested/_has_nested in favour of
  fast_llm.config.get_nested_dict_value (item 7); share assert_no_peft between
  Llama and Apriel2 base-model converters (item 10).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Subtree drops are now visible at the declaration site (review item 1).
  Five Custom converters previously claimed a sub-config subtree via
  ``recurses=True`` while only round-tripping a fraction of its architecture
  leaves; each is now non-recursive (lists every leaf it actually round-trips)
  with sibling ``IgnoredConfigConverter`` entries for the leaves the format
  drops on purpose. Sites: Apriel mamba ``convolution_layer`` and ``dt_layer``,
  Apriel2 GDN ``convolution_layer``, Apriel2 KDA ``convolution_layer`` and
  ``normalization``.

- Architecture-coverage walker now descends into ``dict[str, Config]`` and
  list/tuple-of-Config fields (item 2). Previously masked by
  ``TypedDictContainerConfigConverter.recurses=True``; the walker now matches
  what the docstring claims.

- Coverage error gains a hint when missing paths share a top-level prefix that
  is claimed non-recursively (item 3 — message half only): suggests
  Nested/Dispatch or ``recurses=True`` on Custom/ImportOnly. No new ``recurses``
  kwarg on the base primitives.

- Single ``effective_bias(layer_config, default)`` helper in llama.py replaces
  three near-duplicates (item 4): ``_resolve_bias_enabled`` in apriel.py,
  ``_get_effective_bias`` in apriel2.py, and the inline ternary in
  ``Apriel2MLPConverter``.

- Apriel2 decoder dispatch lookup lifted into module-level
  ``get_apriel2_decoder_converter(decoder)`` (item 6); used by both the text
  and multimodal base-model converters.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Apriel2 decoder converters expose `block_converter_class` ClassVar so subclasses can swap the
  block converter, mirroring the LlamaDecoderConverter polymorphism pattern.
* `_create_config_converters` is memoized via `functools.cache` (keyed by cls), so per-class
  declarations are built once. Convert two `out = super(); out[k] = v` mutation patterns
  (qwen2, llava) to spread+new-dict so the cached parent dict is never mutated.
* `NestedConfigConverter` auto-injects the HF `type` discriminator from the target
  converter's `hf_type_name`, mirroring `DispatchConfigConverter`/`TypedDictContainer`.
  Drops a manual `ConstantExportConfigConverter` from `Apriel2MLPConverter`.
* Move architecture-coverage check to `tests/models/test_converters.py`, parametrized
  per-format. Walks each `HuggingfaceStateDictCheckpointHandler.base_model_converter_class`
  through the modular converter tree (Nested/Dispatch/TypedDict + `*_converter_class` ClassVars)
  and runs `check_architecture_coverage` on each `ConfigSectionConverter` node. The
  per-export runtime invocation is removed.
* Same test verifies `OptionalConfigConverter` sentinels match the resolved field default —
  catches silent round-trip drift if a Fast-LLM default changes.
* Two latent bugs surfaced and fixed by the new test:
  * `apriel.py` GDN/KDA converters were missing `convolution_layer` architecture claims.
  * `Apriel2MambaConverter.d_xb`/`dt_rank` misused `OptionalConfigConverter`
    (sentinel=None on a non-Optional int) - converted to `RenameConfigConverter`.

Deferred to follow-up commit: HF-side coverage check on every import (item 10) - needs
`hf_paths` audit across ~20 Custom/ImportOnly call sites and a flat-merge-aware walker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Symmetric counterpart to the architecture-coverage check (already a test). Walks the HF
config dict at the import boundary and raises on any key not consumed by some declaration
in the converter tree. Catches transformers-version drift, manual edits, and corrupted
configs at the point of import rather than as cryptic downstream failures.

* ``ConfigConverter`` primitives gain a recursive ``_consumed_hf_paths`` walker. Nested/
  Dispatch/TypedDictContainer with a fixed ``hf_path`` claim it as a subtree prefix; their
  flat-merge variants (``hf_path=None``) pull the sub-converter's claims up to the current
  level so a parent's check sees them.
* ``CustomConfigConverter`` / ``ImportOnlyConfigConverter`` gain an ``hf_paths`` kwarg;
  every existing call site is audited and populated. ``IgnoredConfigConverter`` gains an
  ``hf_paths`` kwarg used for HF-only fields Fast-LLM intentionally does not consume
  (Mixtral router toggles, Qwen2 sliding-window machinery, Apriel2's default-injected
  ``embeddings`` subdict from ``Apriel2TextConfig``).
* ``HuggingfaceStateDictCheckpointHandler`` runs the check from ``_import_config`` against
  the base-model converter. A class-level allowlist covers transformers' generic
  ``PretrainedConfig`` fields and inference-only metadata that's always permitted. The
  ``Apriel2`` text handler's override is updated to call the shared ``_check_hf_coverage``
  helper. Non-``ConfigSectionConverter`` base-model converters (Llava aggregators) skip the
  check transparently.
* ``LlamaBaseModelConverter``'s decoder Custom - which wraps the imperative
  ``LlamaDecoderConverter`` - auto-extends its ``hf_paths`` from the block converter's
  ``_consumed_hf_paths``, so Mistral/Mixtral/Qwen2/MTPLlama/Apriel inherit correct
  coverage. ``AprielBlockConverter`` (per-block-type dispatcher, also imperative) gets its
  own ``_consumed_hf_paths`` that unions across registered per-mixer block converters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Apriel2BlockConverter._validate_export asserts type(config.mlp) is MLPConfig,
  restoring the pre-PR rejection of MoEMLPConfig that NestedConfigConverter
  would otherwise silently descend through (dropping experts/routing/router).
- _consumed_hf_paths now expands a nested sub-converter's claims under its
  hf_path prefix (NestedConfigConverter/DispatchConfigConverter with hf_path
  set) so check_hf_coverage descends and flags unknown keys deep inside
  apriel2's head/decoder, llava's vision_config, etc.
- Pin prediction_heads to 1 in Llama and Apriel2 head converters via
  ConstantImportConfigConverter so non-default values fail on export instead
  of silently dropping (MTP-Llama overrides the entry with Rename).
- Document the cache-mutation hazard on _create_config_converters: subclasses
  must spread the parent's dict, never mutate it in place.
- Narrow Apriel2BaseModelConverter's HF embeddings Ignored to the single
  injected leaf so future transformers fields in the same subdict trip the
  coverage check.
- Tighten Mixtral router Ignored comment to record the structural rationale
  (router.weight has no architecture sub-fields, so the blanket claim is
  equivalent to the narrowest possible non-recursive claim).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Surface cleanups from the fine-pass review: rename ``cur`` → ``current`` in
``_get_attr_path``, merge an unintentionally split f-string in the
``DispatchConfigConverter`` error path, switch bare ``return`` to ``pass`` in
empty ``-> None`` converter bodies, type-annotate ``_per_layer_bias_export``
and ``get_apriel2_decoder_converter`` (dropping a redundant forward-ref
quote), and replace ``<->`` with ``↔`` in the remaining converter docstrings
for consistency across the migration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jlamypoirier jlamypoirier changed the title Declarative checkpoint config conversion (Llama pilot) Declarative checkpoint config conversion May 12, 2026
jlamypoirier and others added 9 commits May 11, 2026 20:34
Round 6 picks up one latent correctness bug, consolidates duplicated
declarations into framework primitives, and tidies several surface items.

* ``Apriel2HeadConverter._validate_export`` now asserts ``RMSNormalizationConfig``:
  the config side dispatches normalization through ``APRIEL2_NORM_REGISTRY`` while
  the weight side hardcoded RMS, so a LayerNorm/NoNorm head would have silently
  dropped its bias on convert.
* ``ConfigSectionConverter.import_config`` injects ``{"type": <dynamic_type_name>}``
  from ``fast_llm_config_class`` automatically, removing the redundant injection
  from ``NestedConfigConverter`` / ``TypedDictContainerConfigConverter`` and
  collapsing four hand-rolled overrides (Apriel mamba/gdn/kda + Mixtral moe).
* Deleted ``MTPLlamaDecoderConverter`` — its overrides were byte-identical to the
  parent's after the migration, with the only diff being a Pattern restriction
  that the parent now handles correctly through the multi-block-equality branch.
* Extracted ``_per_layer_bias_converter`` and ``_apriel2_conv_kernel_converter``
  helpers in apriel2.py to collapse pairs of byte-identical CustomConfigConverter
  declarations.
* ``AprielBlockConverter._consumed_hf_paths`` gets ``@functools.cache`` for parity
  with the base ``ConfigSectionConverter._consumed_hf_paths``.
* ``effective_bias`` typed as ``AffineLinearConfig``; ``NoPeftConfig`` import
  moved to top of llama.py (the module is not a config module subject to the
  heavy-import rule); stale ``# TODO: Peft?`` removed.
* CLAUDE.md naming convention clarified: single underscore covers non-public
  (private or protected), matching the project's actual usage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Apriel2VisionAttention/Block/MLP/Encoder/Embeddings/Adapter/Model and the
top-level Apriel2MultimodalBaseModelConverter become ConfigSectionConverter
subclasses. The vision branch keeps inheriting weight-side get_converters from
Pixtral/Llava bases via MRO; only the config side is declarative.

Cross-section rotary metadata (patch_size/max_image_size derived from
embeddings.patch_height) is injected at the vision-model level via a Custom,
which is the smallest scope that sees both halves.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
LlavaVisionAdapter/VisionModel/Base converters become ConfigSectionConverter
subclasses. The adapter is declared at the LlavaBaseModelConverter scope (not
inside VisionModelConverter) because its intermediate_size derives from
text_config.hidden_size — a cross-section reference reachable only at the
top-level HF dict.

PixtralAttentionConverter's head_size declaration changes from DefaultConfig
(emits head_dim) to ImportOnly (derives from hidden_size / num_attention_heads).
The previous head_dim popping in the imperative LlavaVisionModelConverter is
replaced by a head_size invariant check on the new declarative converter's
_validate_export.

Apriel2VisionAdapterConverter loses its MRO trick (ConfigSectionConverter +
LlavaVisionAdapterConverter) and inherits cleanly from Llava — now that Llava
is also a ConfigSectionConverter, the trick would produce an inconsistent MRO.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mistral/Qwen2/Mixtral DecoderConverter subclasses disappear; their BaseModelConverters
now plug in block_converter_class directly, and LlamaBaseModelConverter inlines the
Fixed/Pattern dispatch (config + weight sides) parameterised by that ClassVar.

LlamaDecoderConverter stays as an imperative helper for the cases that don't fit the
common pattern: Pixtral's vision encoder dispatch (Llava) and Apriel's per-position
hybrid layout dispatch. AprielBaseModelConverter overrides the "decoder" declaration
to delegate to AprielDecoderConverter (held via a new apriel_decoder_converter_class
ClassVar) instead of using the inlined Llama dispatch.

Qwen2BaseModelConverter.import_config (one-line MRoPE guard) becomes a declarative
ImportOnlyConfigConverter claiming use_mrope and asserting on import.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
transformers' PretrainedConfig.to_dict() populates _name_or_path/architectures/
torch_dtype/transformers_version on nested configs (vision_config is a
PretrainedConfig under transformers.LlavaConfig), so a round-tripped save
carries these keys back through the HF coverage check. The top-level
_HF_METADATA_ALLOWLIST only matches single-key paths, so we mark them
explicitly ignored inside LlavaVisionModelConverter.

Fixes test_conversion[llava] which failed on
"unknown key 'vision_config.architectures'".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The top-level _HF_METADATA_ALLOWLIST covers generic PretrainedConfig fields
(architectures, torch_dtype, transformers_version, output_hidden_states, …),
but the recursive coverage walker only matched it on single-key paths. After a
round-tripped save, transformers populates the same metadata on nested
sub-configs like Llava's vision_config, which then trip the walker.

Match the allowlist against any segment of the path. Revert the previous
local-scope ignore claim on LlavaVisionModelConverter, which only patched a
subset of the keys and didn't help apriel2 or future nested formats.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
transformers.LlavaConfig.from_dict(...).save_pretrained(...) round-trips the
config through transformers.LlavaConfig and PixtralVisionConfig, which fill in
many model-specific defaults Fast-LLM doesn't consume (head_dim, image_size,
layer_norm_eps, initializer_factor, projection_dim, vocab_size in vision_config;
image_seq_length, tie_word_embeddings at the top level).

Add IgnoredConfigConverter claims for these so the recursive HF coverage check
accepts round-tripped saves. tie_word_embeddings is intentionally claimed only
at the top level — Fast-LLM tracks it inside text_config via Llama's
tied_embedding_weight declaration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings in main's new architecture fields (Gemma 4 + stochastic-mixer
oversampling) and resolves the conflicts with the declarative converter
migration:

  AttentionConfig.{query,key,value}_norm, shared_key_value
  MLPConfig.{pre,post}_norm
  DecoderBlockConfig.{pre,post}_{mixer,mlp}_normalization, output_scale
  LanguageModelEmbeddingsConfig.embedding_scale
  LanguageModelHeadConfig.final_logit_softcap
  MoEMLPConfig.{router_normalization, router_scale, router_input_scale,
                router_per_expert_scale}

Each new field is claimed by ConstantImportConfigConverter (asserts the
Fast-LLM default on export, injects the default on import) on every
section converter whose fast_llm_config_class is the field's container.
OptionalParameterConfig fields (output_scale, router_scale,
router_per_expert_scale) are claimed via IgnoredConfigConverter plus a
``_validate_export`` assertion mirroring main's imperative ``not enabled``
check.

The Gemma 4 imperative converter (``conversion/gemma4.py``) is brought
in as-is, with its now-removed ``QueryWeightConverter`` import dropped
to keep the module importable; a follow-up commit ports it to the
declarative API.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Migrate the top-level Gemma 4 base-model converter from the imperative
import_config/export_config shape to the declarative
ConfigSectionConverter API used by the rest of the codebase. The
embeddings/decoder/head sections remain imperative helpers — each is
wrapped in a recursing CustomConfigConverter because Gemma 4's HF
format cross-references hidden_size (embeddings, MoE router scale) and
merges two block variants (sliding_attention / full_attention) into one
HF dict, neither of which fits the standard per-section decomposition.

The previously-imperative top-level guards (PLE, KV sharing, double-wide
MLP, bidirectional attention) become declarative
ConstantExportConfigConverter / CustomConfigConverter entries that
preserve the rejected-feature checks while running through the same
HF-coverage walker as every other format. ``vocab_size_per_layer_input``
is claimed via IgnoredConfigConverter so the coverage walker accepts
the value transformers fills on save_pretrained round-trip.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant