[BugFix] fix consume_signals barrier deadlock in PD separation#8034
[BugFix] fix consume_signals barrier deadlock in PD separation#8034kevincheng2 wants to merge 1 commit into
Conversation
f6f19e0 to
efdaa94
Compare
In PD separation mode, different ranks may receive cache_info at different times. When consume_signals gets a layer0 signal, some ranks find the engine_idx already in idx_cache_task_dict (ready) while others don't (pending). This causes different ranks to put different batch_engine_signals into the queue, leading to mismatched finish_send_cache_barrier.wait() calls and deadlock. Fix: route all layer0 signals through pending_layer0_signals uniformly, then immediately recover any that already have cache_info registered. Each recovered signal is put into the queue individually (single-request batch) to ensure all ranks have identical batch granularity regardless of recovery timing.
efdaa94 to
87a8601
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-10 17:47:48
📋 Review 摘要
PR 概述:修复 PD separation 中 layer0 signal 因 cache_info 异步注册导致的跨 rank 入队粒度不一致问题。
变更范围:fastdeploy/cache_manager/cache_messager.py
影响面 Tag:KVCache PD Disaggregation
问题
未发现阻塞性问题。PR 规范问题在下面章节报,不在这里重复。
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | 原实现会把当前已注册 cache_info 的 pending layer0 signal 聚合成一个 batch 入队,仍依赖本地注册状态。 | ✅ 已修复 |
📝 PR 规范检查
标题符合 [BugFix] Tag 规范;PR 描述缺少当前仓库要求的 ## Motivation、## Modifications、## Usage or Command、## Accuracy Tests、## Checklist 完整结构。
PR 描述建议(点击展开,可直接复制)
## Motivation
Fix a deadlock in PD separation where layer0 KV signals can be batched differently across ranks when `cache_info` is registered asynchronously.
## Modifications
- Route layer0 signals through `pending_layer0_signals` in `fastdeploy/cache_manager/cache_messager.py`.
- Recover layer0 signals after the corresponding cache task information is available.
## Usage or Command
N/A
## Accuracy Tests
N/A. This change affects cache transfer synchronization and does not change model accuracy.
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
本轮重点检查了 layer0 signal 的两个恢复入口和发送线程的 barrier 调用路径。当前 diff 已将 consume_signals 中的就绪 batch 分支改为单个 signal 恢复入队,_add_cache_task_thread 中 cache_info 先到的恢复路径也保持单个 signal batch,未发现新的阻塞性代码问题。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #8034 +/- ##
==========================================
Coverage ? 67.63%
==========================================
Files ? 470
Lines ? 66108
Branches ? 10186
==========================================
Hits ? 44710
Misses ? 18551
Partials ? 2847
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Problem
In PD separation mode,
consume_signalsthread splits layer0 signals intoready_engine_signalsandpending_engine_signalsbased on whetherengine_idxexists inidx_cache_task_dict. Since_add_cache_task_threadfillsidx_cache_task_dictasynchronously, different ranks may see different states at the same moment:idx_cache_task_dict→ goes toready_engine_signals→ put into queue as batch[(A, xx), (B, xx)]idx_cache_task_dictyet → goes topending_engine_signals→ later recovered individually as[(A, xx)]This causes
batch_engine_signalsto differ across ranks, leading to mismatchedfinish_send_cache_barrier.wait()calls and deadlock.Fix
Route all layer0 signals through
pending_layer0_signalsuniformly. The_add_cache_task_threadrecovers them one-by-one aftercache_infoarrives, ensuring all ranks produce identical per-request signals into the queue.Testing
idx_cache_task_dictcheck branch inconsume_signals, unifying all layer0 signals to the pending path