Diff kv head by chang-wenbin · Pull Request #8037 · PaddlePaddle/FastDeploy

chang-wenbin · 2026-06-10T13:13:16Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

codecov-commenter · 2026-06-10T13:55:42Z

Codecov Report

❌ Patch coverage is 71.42857% with 36 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@edc885d). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/layers/linear.py	65.71%	11 Missing and 1 partial ⚠️
fastdeploy/cache_manager/v1/cache_controller.py	76.66%	6 Missing and 1 partial ⚠️
...l_executor/layers/attention/append_attn_backend.py	50.00%	5 Missing and 2 partials ⚠️
fastdeploy/worker/gpu_model_runner.py	84.84%	4 Missing and 1 partial ⚠️
...eploy/model_executor/layers/attention/attention.py	77.77%	1 Missing and 1 partial ⚠️
...astdeploy/model_executor/ops/triton_ops/do_rope.py	0.00%	2 Missing ⚠️
fastdeploy/config.py	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8037   +/-   ##
==========================================
  Coverage           ?   67.41%           
==========================================
  Files              ?      471           
  Lines              ?    66413           
  Branches           ?    10231           
==========================================
  Hits               ?    44769           
  Misses             ?    18788           
  Partials           ?     2856

Flag	Coverage Δ
GPU	`77.41% <71.42%> (?)`
XPU	`6.98% <1.58%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-11 18:59:27

📋 Review 摘要

PR 概述：支持 per-layer KV heads、V head dim 不同以及按层 attention backend/cache 初始化。
变更范围：Config、Attention/AppendAttention、KVCache、GPUModelRunner、QKV/QKVG loader、PaddleFormers fallback。
影响面 Tag：[FDConfig] [KVCache] [OP] [Models] [Loader]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/model_executor/layers/attention/attention.py:97`	per-layer KV heads 未同步到 QKV projection/loader，QKV 宽度和 backend/cache 期望不一致
🔴 Bug	`fastdeploy/worker/gpu_model_runner.py:1752`	attention launch buffer 按最大 KV heads 分配，会低估 KV head 较少层所需 tile buffer

历史 Findings 修复情况

Finding	问题	状态
F1	`use_output` 分支对 K/V 不等宽仍走 `append_attention_with_output` 的等宽解析路径	⚠️ 仍存在
F2	KV replica TP 的 split Q/K/V 加载仍有 V 按 `head_dim` 截取的路径	⚠️ 仍存在
F3	`QKVGateParallelLinear` 仍调用未定义的 `_get_kv_shard_id()`	⚠️ 仍存在
F4	PaddleFormers QKV loader 仍按 K/V 等宽处理 `v_head_dim != head_dim`	⚠️ 仍存在

📝 PR 规范检查

标题缺少官方 Tag，PR 描述仍是模板占位且未填写有效内容。

标题建议（可直接复制）：

[KVCache] Support per-layer KV heads and V head dim

PR 描述建议（点击展开，可直接复制）

## Motivation
支持部分模型中不同层使用不同 KV head 数，以及 K/V head dim 不一致的 attention 与 KV cache 布局。

## Modifications
- 为 ModelConfig 同步 `v_head_dim` 默认值，并在 PaddleFormers text_config 同步该字段。
- Attention/AppendAttention 支持按层选择 attention backend，并基于 `v_head_dim` 创建 value cache shape。
- GPUModelRunner 支持 `num_key_value_heads_list`，按层初始化 attention backend 和 KV cache。
- QKV/QKVG linear loader 调整 Q/K/V/Gate 输出宽度计算，补充相关 linear 单测。
- append_attn template 增加 `GROUP_SIZE=3` 组合。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本轮按风险优先审查了 per-layer KV heads 的投影、backend/cache、buffer 分配和 loader 联动。当前实现仍存在 QKV projection 与 per-layer backend 不一致的问题，并且 attention buffer 容量按错误方向估算；这两处会直接影响不同层 KV head 数不一致的模型正确性，建议修复后再合入。

PaddlePaddle-bot · 2026-06-11T11:02:09Z

+        if num_key_value_heads is None:
+            num_key_value_heads = fd_config.model_config.num_key_value_heads
+        else:
+            num_key_value_heads = num_key_value_heads[self.layer_id]


🔴 Bug num_key_value_heads_list 只改变了 Attention/backend 的 KV head 数，但 QKV projection/loader 仍按全局 model_config.num_key_value_heads 生成 Q/K/V 宽度。

在这个分支里第 i 层 Attention.kv_num_heads 会使用 num_key_value_heads_list[i]；但 QKVParallelLinear.__init__ 仍使用 fd_config.model_config.num_key_value_heads，现有调用点（如 qwen3.py:66、gpt_oss.py:59）也没有传 per-layer kv_num_heads。当某层 list 值和全局值不同，qkv 输出宽度、权重切片和该层 cache/backend 期望的 KV head 数不一致，后续 do_rope / write_cache / append_attention 的 Q/K/V 切分会错。

建议把 layer_id 或解析后的 kv_num_heads 传入对应的 QKVParallelLinear / QKVGateParallelLinear，并让 output_size、loader offset 和 Attention.kv_num_heads 使用同一层的 KV head 数；同时补一个 list 中至少两个不同 KV head 的加载和 forward 单测。

PaddlePaddle-bot · 2026-06-11T11:02:09Z

            decoder_step_token_num=self.speculative_config.num_speculative_tokens + 1,
            num_heads=num_heads,
-            kv_num_heads=self.model_config.kv_num_heads,
+            kv_num_heads=max(kv_num_heads_per_layer),


🔴 Bug launch buffer 按 max(kv_num_heads_per_layer) 分配会低估 KV head 较少层的 tile 数。

allocate_launch_related_buffer 内部用 group_size = num_heads // kv_num_heads 计算 decoder_max_tile_size / encode_max_tile_size；而每层实际调用 get_block_shape_and_split_kv_block 时传的是该 backend 的 self.group_size。当 KV heads 列表不一致时，最小 KV heads 的层有最大的 group_size，例如 [8, 4] 用 max=8 分配的是 group_size=3，但 4-head 层会以 group_size=6 写入同一批 buffers，存在越界或截断风险。

建议按最大 group_size 分配，也就是使用 min(kv_num_heads_per_layer)（或显式计算 max(num_heads // kv)）作为 buffer sizing 的 KV head 数，并加一个 per-layer KV heads 组合的 buffer 容量测试。

chang-wenbin added 9 commits June 8, 2026 20:46

support qkdim!=vdim

44f547c

Merge remote-tracking branch 'origin/develop' into qkdim_vdim

fc06ae2

Merge remote-tracking branch 'origin/develop' into qkdim_vdim

19a7044

support gqa qkdim=192 vdim=128

2958fda

support qkdim!=vdim

303ad42

merge develop

29dce63

update v_head_dim

db7d260

fix & update

94fa1f9

diff_kv_head

b0a901e

chang-wenbin had a problem deploying to Metax_ci June 10, 2026 13:13 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

update kvhead8-4

d7bda46

chang-wenbin had a problem deploying to Metax_ci June 11, 2026 10:14 — with GitHub Actions Failure

chang-wenbin requested a review from PaddlePaddle-bot June 11, 2026 10:15

update kvhead8-4

ec0be7d

chang-wenbin had a problem deploying to Metax_ci June 11, 2026 10:27 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

PaddlePaddle-bot suggested changes Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diff kv head#8037

Diff kv head#8037
chang-wenbin wants to merge 11 commits into
PaddlePaddle:developfrom
chang-wenbin:diff_kv_head

chang-wenbin commented Jun 10, 2026

Uh oh!

codecov-commenter commented Jun 10, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jun 11, 2026

Uh oh!

PaddlePaddle-bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chang-wenbin commented Jun 10, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

codecov-commenter commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jun 10, 2026 •

edited

Loading