-
Notifications
You must be signed in to change notification settings - Fork 751
[BugFix] Fix attention buffer sizing for per-layer KV heads #8043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
chang-wenbin
merged 1 commit into
PaddlePaddle:develop
from
chang-wenbin:fix_bug_for_attn_back_buffer
Jun 14, 2026
+2
−1
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Bug
min(kv_num_heads_per_layer)只覆盖了按group_size递增的 buffer,但 decode-unified 的decode_block_indices容量还会乘以kv_num_heads,这里会在部分分层 KV-head 配置下低估容量。allocate_decode_unified_related_buffer()和allocate_launch_related_buffer()中的容量是max_batch_size * kv_num_heads * max_num_chunk * q_tile_num,而config_for_attention()实际按q_tile_num * kv_chunk_num * kv_num_heads写入。比如num_heads=64、decoder_step_token_num=1、per-layer KV heads 为[8, 1]时,按min=1只预留1 * ceil(64 / 1 / 16) = 4个 q/kv-head tile;layer 0 若使用 8 个 KV heads,会需要8 * ceil(64 / 8 / 16) = 8个,存在越界写或错误配置风险。建议修复方式:
保留最小 KV heads 用于
decoder_batch_ids/encoder_batch_ids等只按group_size放大的 buffer,但对decode_block_indices单独按所有 layer 的最坏值分配,例如计算max(k * ceil(decoder_step_token_num * (num_heads // k) / 16) for k in kv_num_heads_per_layer)后作为容量上界;或者扩展 allocation API,分别传入最大group_size和decode_block_indices容量。