add Mistral-small-3.1 and Pixtral vision support by yicycyc · Pull Request #4591 · PaddlePaddle/PaddleFormers

yicycyc · 2026-06-02T09:29:54Z

PR 新增模型支持：Mistral-small-3.1 Pixtral

复现针对Mistral-small-3.1和Pixtral在transformers端的对应实现进行

mistral-small-3.1的视觉头复用了Pixtral，Pixtral在transformers中的实现只是一个视觉头，Pixtral-12b模型在transformers中通过llava式的方式调用pixtral视觉头和mistral文本组合得到，paddle缺乏前置，因此本PR实现的是Mistral-small-3.1的完整模型以及Pixtral的transformers实现。

精度验证均在缩层模型下进行。

主要改动

主要改动：

新增 mistral3 模型：
- Mistral3Config
- Mistral3Model
- Mistral3ForConditionalGeneration
- Mistral3ForCausalLM alias
- HF safetensors / flex checkpoint AOA 转换规则
- 支持图文前向与 SFT 训练
新增 pixtral 视觉塔：
- PixtralVisionConfig
- PixtralVisionModel
- PixtralImageProcessor
- PixtralProcessor
补充 Auto 映射：
- AutoConfig
- AutoModel
- AutoModelForCausalLM
- AutoModelForConditionalGeneration
- AutoProcessor
- AutoImageProcessor
补充多模态数据处理：
- 支持 image_sizes 在 dataset/collate 流程中传递
- 支持 Mistral3/Pixtral 图像 token 展开与图文输入构造
更新模型列表、能力矩阵与模型单测

前向对齐验证

模型：mistral-small-3.1缩层模型

两侧加载完全相同的 .npy 输入：

input_ids
attention_mask
pixel_values
image_sizes

输入样例为一张 224x224 图片加 prompt：

Describe this image briefly.

总 token 数为 256。

结果

精度	logits mean_diff	logits max_diff	结论
FP32	`1.01e-06`	`1.72e-05`	几乎零差异
BF16	`0`	`0`	完全一致，Top-1 token 一致

生成对齐

Text-only 生成

Transformers 生成的 10 个 token:
[117577, 115201, 83673, 64162, 107744, 111937, 11254, 111937, 119792, 62615]
PaddleFormers 生成的 10 个 token:
[117577, 115201, 83673, 64162, 107744, 111937, 11254, 111937, 119792, 62615]

Multimodal 生成
输入：112x112 的测试图片 + "Describe this image." 文本 prompt

Transformers 生成的 10 个 token:
[64162, 18845, 14124, 16814, 5744, 31026, 33565, 34868, 14456, 61350]
PaddleFormers 生成的 10 个 token:
[64162, 18845, 14124, 16814, 5744, 31026, 33565, 34868, 14456, 61350]

两侧完全一致

训练验证

1、文本

使用 GSM8K 做 BF16 full-SFT，Paddle 4 卡 sharding stage3 跑满 300 step，并与 Torch/ms-swift ZeRO-3 训练曲线对比。

共同设置：

max_seq_len = 512
global batch size = 4
max_steps = 300
learning_rate = 1e-5
warmup_steps = 0
weight_decay = 0
seed = 42
shuffle 关闭
attention eager

训练结果：

step	Paddle loss	Swift loss	diff
1	`13.359375`	`13.401804`	`-0.042429`
2	`9.484375`	`9.411279`	`0.073096`
10	`4.585938`	`4.784250`	`-0.198312`
20	`4.011719`	`4.166615`	`-0.154897`
50	`3.625000`	`3.814764`	`-0.189764`
100	`3.500000`	`2.874740`	`0.625260`
150	`3.425781`	`3.515725`	`-0.089944`
200	`3.250000`	`3.552930`	`-0.302930`
243	`3.207031`	`3.151531`	`0.055500`
300	`3.089844`	`3.604705`	`-0.514861`

2、多模态

使用 https://github.com/PaddlePaddle/PaddleFormers/blob/develop/docs/zh/dataset_format.md#24-%E5%A4%9A%E6%A8%A1%E6%80%81%E6%8C%87%E4%BB%A4%E5%BE%AE%E8%B0%83sft%E6%95%B0%E6%8D%AE%E6%A0%BC%E5%BC%8F 数据集做 BF16 full-SFT，Paddle 4 卡 sharding stage3 跑满 300 step，并与 Torch/ms-swift ZeRO-3 训练曲线对比。

max_seq_len / max_length = 4096
global batch size = 4
max_steps = 300
learning_rate = 1e-5
warmup_steps = 0
weight_decay = 0
shuffle 关闭
attention eager

训练结果：

step	Paddle loss	Swift loss	diff
1	`12.687500`	`12.662024`	`0.025476`
2	`9.625000`	`9.594838`	`0.030162`
3	`9.625000`	`9.894209`	`-0.269209`
10	`7.515625`	`7.706814`	`-0.191189`
20	`6.234375`	`6.151044`	`0.083331`
50	`1.898438`	`1.336239`	`0.562199`
100	`0.507553`	`0.097169`	`0.410384`
150	`0.284668`	`0.120010`	`0.164658`
200	`0.155029`	`0.010435`	`0.144594`
243	`0.042799`	`0.630276`	`-0.587477`
275	`0.810303`	`0.016147`	`0.794156`
300	`0.578127`	`0.103580`	`0.474548`

feat: add Mistral3 and Pixtral vision support

640e06d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Mistral-small-3.1 and Pixtral vision support#4591

add Mistral-small-3.1 and Pixtral vision support#4591
yicycyc wants to merge 1 commit into
PaddlePaddle:developfrom
yicycyc:feat/mistral3-pixtral-clean

yicycyc commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yicycyc commented Jun 2, 2026

PR 新增模型支持：Mistral-small-3.1 Pixtral

主要改动

前向对齐验证

结果

生成对齐

训练验证

1、文本

2、多模态

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant