Skip to content

[BugFix] fix consume_signals barrier deadlock in PD separation#8034

Open
kevincheng2 wants to merge 1 commit into
PaddlePaddle:developfrom
kevincheng2:fix/consume-signals-barrier-deadlock
Open

[BugFix] fix consume_signals barrier deadlock in PD separation#8034
kevincheng2 wants to merge 1 commit into
PaddlePaddle:developfrom
kevincheng2:fix/consume-signals-barrier-deadlock

Conversation

@kevincheng2

Copy link
Copy Markdown
Collaborator

Problem

In PD separation mode, consume_signals thread splits layer0 signals into ready_engine_signals and pending_engine_signals based on whether engine_idx exists in idx_cache_task_dict. Since _add_cache_task_thread fills idx_cache_task_dict asynchronously, different ranks may see different states at the same moment:

  • Rank 0: engine_idx A is in idx_cache_task_dict → goes to ready_engine_signals → put into queue as batch [(A, xx), (B, xx)]
  • Rank 1: engine_idx A is NOT in idx_cache_task_dict yet → goes to pending_engine_signals → later recovered individually as [(A, xx)]

This causes batch_engine_signals to differ across ranks, leading to mismatched finish_send_cache_barrier.wait() calls and deadlock.

Fix

Route all layer0 signals through pending_layer0_signals uniformly. The _add_cache_task_thread recovers them one-by-one after cache_info arrives, ensuring all ranks produce identical per-request signals into the queue.

Testing

  • Verified logic correctness by code review
  • The fix is minimal: removes the idx_cache_task_dict check branch in consume_signals, unifying all layer0 signals to the pending path

PaddlePaddle-bot

This comment was marked as outdated.

In PD separation mode, different ranks may receive cache_info at different
times. When consume_signals gets a layer0 signal, some ranks find the
engine_idx already in idx_cache_task_dict (ready) while others don't (pending).
This causes different ranks to put different batch_engine_signals into the
queue, leading to mismatched finish_send_cache_barrier.wait() calls and
deadlock.

Fix: route all layer0 signals through pending_layer0_signals uniformly,
then immediately recover any that already have cache_info registered.
Each recovered signal is put into the queue individually (single-request
batch) to ensure all ranks have identical batch granularity regardless
of recovery timing.
@kevincheng2 kevincheng2 force-pushed the fix/consume-signals-barrier-deadlock branch from efdaa94 to 87a8601 Compare June 10, 2026 09:16

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-10 17:47:48

📋 Review 摘要

PR 概述:修复 PD separation 中 layer0 signal 因 cache_info 异步注册导致的跨 rank 入队粒度不一致问题。
变更范围fastdeploy/cache_manager/cache_messager.py
影响面 TagKVCache PD Disaggregation

问题

未发现阻塞性问题。PR 规范问题在下面章节报,不在这里重复。

历史 Findings 修复情况

Finding 问题 状态
F1 原实现会把当前已注册 cache_info 的 pending layer0 signal 聚合成一个 batch 入队,仍依赖本地注册状态。 ✅ 已修复

📝 PR 规范检查

标题符合 [BugFix] Tag 规范;PR 描述缺少当前仓库要求的 ## Motivation## Modifications## Usage or Command## Accuracy Tests## Checklist 完整结构。

PR 描述建议(点击展开,可直接复制)
## Motivation
Fix a deadlock in PD separation where layer0 KV signals can be batched differently across ranks when `cache_info` is registered asynchronously.

## Modifications
- Route layer0 signals through `pending_layer0_signals` in `fastdeploy/cache_manager/cache_messager.py`.
- Recover layer0 signals after the corresponding cache task information is available.

## Usage or Command
N/A

## Accuracy Tests
N/A. This change affects cache transfer synchronization and does not change model accuracy.

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本轮重点检查了 layer0 signal 的两个恢复入口和发送线程的 barrier 调用路径。当前 diff 已将 consume_signals 中的就绪 batch 分支改为单个 signal 恢复入队,_add_cache_task_thread 中 cache_info 先到的恢复路径也保持单个 signal batch,未发现新的阻塞性代码问题。

@codecov-commenter

codecov-commenter commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@1eed8ba). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8034   +/-   ##
==========================================
  Coverage           ?   67.63%           
==========================================
  Files              ?      470           
  Lines              ?    66108           
  Branches           ?    10186           
==========================================
  Hits               ?    44710           
  Misses             ?    18551           
  Partials           ?     2847           
Flag Coverage Δ
GPU 77.72% <100.00%> (?)
XPU 7.01% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants