Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion fastdeploy/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -1749,7 +1749,8 @@ def _initialize_attn_backend(self) -> None:
decoder_block_shape_q=decoder_block_shape_q,
decoder_step_token_num=self.speculative_config.num_speculative_tokens + 1,
num_heads=num_heads,
kv_num_heads=max(kv_num_heads_per_layer),
# This requires the largest possible group size, corresponding to the smallest kv-num-heads.
kv_num_heads=min(kv_num_heads_per_layer),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug min(kv_num_heads_per_layer) 只覆盖了按 group_size 递增的 buffer,但 decode-unified 的 decode_block_indices 容量还会乘以 kv_num_heads,这里会在部分分层 KV-head 配置下低估容量。

allocate_decode_unified_related_buffer()allocate_launch_related_buffer() 中的容量是 max_batch_size * kv_num_heads * max_num_chunk * q_tile_num,而 config_for_attention() 实际按 q_tile_num * kv_chunk_num * kv_num_heads 写入。比如 num_heads=64decoder_step_token_num=1、per-layer KV heads 为 [8, 1] 时,按 min=1 只预留 1 * ceil(64 / 1 / 16) = 4 个 q/kv-head tile;layer 0 若使用 8 个 KV heads,会需要 8 * ceil(64 / 8 / 16) = 8 个,存在越界写或错误配置风险。

建议修复方式:
保留最小 KV heads 用于 decoder_batch_ids/encoder_batch_ids 等只按 group_size 放大的 buffer,但对 decode_block_indices 单独按所有 layer 的最坏值分配,例如计算 max(k * ceil(decoder_step_token_num * (num_heads // k) / 16) for k in kv_num_heads_per_layer) 后作为容量上界;或者扩展 allocation API,分别传入最大 group_sizedecode_block_indices 容量。

block_size=self.fd_config.cache_config.block_size,
head_dim=head_dim,
dtype=self.model_config.dtype,
Expand Down
Loading