Diff kv head#8037
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #8037 +/- ##
==========================================
Coverage ? 67.41%
==========================================
Files ? 471
Lines ? 66413
Branches ? 10231
==========================================
Hits ? 44769
Misses ? 18788
Partials ? 2856
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-11 18:59:27
📋 Review 摘要
PR 概述:支持 per-layer KV heads、V head dim 不同以及按层 attention backend/cache 初始化。
变更范围:Config、Attention/AppendAttention、KVCache、GPUModelRunner、QKV/QKVG loader、PaddleFormers fallback。
影响面 Tag:[FDConfig] [KVCache] [OP] [Models] [Loader]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/model_executor/layers/attention/attention.py:97 |
per-layer KV heads 未同步到 QKV projection/loader,QKV 宽度和 backend/cache 期望不一致 |
| 🔴 Bug | fastdeploy/worker/gpu_model_runner.py:1752 |
attention launch buffer 按最大 KV heads 分配,会低估 KV head 较少层所需 tile buffer |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | use_output 分支对 K/V 不等宽仍走 append_attention_with_output 的等宽解析路径 |
|
| F2 | KV replica TP 的 split Q/K/V 加载仍有 V 按 head_dim 截取的路径 |
|
| F3 | QKVGateParallelLinear 仍调用未定义的 _get_kv_shard_id() |
|
| F4 | PaddleFormers QKV loader 仍按 K/V 等宽处理 v_head_dim != head_dim |
📝 PR 规范检查
标题缺少官方 Tag,PR 描述仍是模板占位且未填写有效内容。
标题建议(可直接复制):
[KVCache] Support per-layer KV heads and V head dimPR 描述建议(点击展开,可直接复制)
## Motivation
支持部分模型中不同层使用不同 KV head 数,以及 K/V head dim 不一致的 attention 与 KV cache 布局。
## Modifications
- 为 ModelConfig 同步 `v_head_dim` 默认值,并在 PaddleFormers text_config 同步该字段。
- Attention/AppendAttention 支持按层选择 attention backend,并基于 `v_head_dim` 创建 value cache shape。
- GPUModelRunner 支持 `num_key_value_heads_list`,按层初始化 attention backend 和 KV cache。
- QKV/QKVG linear loader 调整 Q/K/V/Gate 输出宽度计算,补充相关 linear 单测。
- append_attn template 增加 `GROUP_SIZE=3` 组合。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
本轮按风险优先审查了 per-layer KV heads 的投影、backend/cache、buffer 分配和 loader 联动。当前实现仍存在 QKV projection 与 per-layer backend 不一致的问题,并且 attention buffer 容量按错误方向估算;这两处会直接影响不同层 KV head 数不一致的模型正确性,建议修复后再合入。
| if num_key_value_heads is None: | ||
| num_key_value_heads = fd_config.model_config.num_key_value_heads | ||
| else: | ||
| num_key_value_heads = num_key_value_heads[self.layer_id] |
There was a problem hiding this comment.
🔴 Bug num_key_value_heads_list 只改变了 Attention/backend 的 KV head 数,但 QKV projection/loader 仍按全局 model_config.num_key_value_heads 生成 Q/K/V 宽度。
在这个分支里第 i 层 Attention.kv_num_heads 会使用 num_key_value_heads_list[i];但 QKVParallelLinear.__init__ 仍使用 fd_config.model_config.num_key_value_heads,现有调用点(如 qwen3.py:66、gpt_oss.py:59)也没有传 per-layer kv_num_heads。当某层 list 值和全局值不同,qkv 输出宽度、权重切片和该层 cache/backend 期望的 KV head 数不一致,后续 do_rope / write_cache / append_attention 的 Q/K/V 切分会错。
建议把 layer_id 或解析后的 kv_num_heads 传入对应的 QKVParallelLinear / QKVGateParallelLinear,并让 output_size、loader offset 和 Attention.kv_num_heads 使用同一层的 KV head 数;同时补一个 list 中至少两个不同 KV head 的加载和 forward 单测。
| decoder_step_token_num=self.speculative_config.num_speculative_tokens + 1, | ||
| num_heads=num_heads, | ||
| kv_num_heads=self.model_config.kv_num_heads, | ||
| kv_num_heads=max(kv_num_heads_per_layer), |
There was a problem hiding this comment.
🔴 Bug launch buffer 按 max(kv_num_heads_per_layer) 分配会低估 KV head 较少层的 tile 数。
allocate_launch_related_buffer 内部用 group_size = num_heads // kv_num_heads 计算 decoder_max_tile_size / encode_max_tile_size;而每层实际调用 get_block_shape_and_split_kv_block 时传的是该 backend 的 self.group_size。当 KV heads 列表不一致时,最小 KV heads 的层有最大的 group_size,例如 [8, 4] 用 max=8 分配的是 group_size=3,但 4-head 层会以 group_size=6 写入同一批 buffers,存在越界或截断风险。
建议按最大 group_size 分配,也就是使用 min(kv_num_heads_per_layer)(或显式计算 max(num_heads // kv))作为 buffer sizing 的 KV head 数,并加一个 per-layer KV heads 组合的 buffer 容量测试。
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.