[BugFix] fix consume_signals barrier deadlock in PD separation by kevincheng2 · Pull Request #8034 · PaddlePaddle/FastDeploy

kevincheng2 · 2026-06-10T07:57:04Z

Problem

In PD separation mode, consume_signals thread splits layer0 signals into ready_engine_signals and pending_engine_signals based on whether engine_idx exists in idx_cache_task_dict. Since _add_cache_task_thread fills idx_cache_task_dict asynchronously, different ranks may see different states at the same moment:

Rank 0: engine_idx A is in idx_cache_task_dict → goes to ready_engine_signals → put into queue as batch [(A, xx), (B, xx)]
Rank 1: engine_idx A is NOT in idx_cache_task_dict yet → goes to pending_engine_signals → later recovered individually as [(A, xx)]

This causes batch_engine_signals to differ across ranks, leading to mismatched finish_send_cache_barrier.wait() calls and deadlock.

Fix

Route all layer0 signals through pending_layer0_signals uniformly. The _add_cache_task_thread recovers them one-by-one after cache_info arrives, ensuring all ranks produce identical per-request signals into the queue.

Testing

Verified logic correctness by code review
The fix is minimal: removes the idx_cache_task_dict check branch in consume_signals, unifying all layer0 signals to the pending path

In PD separation mode, different ranks may receive cache_info at different times. When consume_signals gets a layer0 signal, some ranks find the engine_idx already in idx_cache_task_dict (ready) while others don't (pending). This causes different ranks to put different batch_engine_signals into the queue, leading to mismatched finish_send_cache_barrier.wait() calls and deadlock. Fix: route all layer0 signals through pending_layer0_signals uniformly, then immediately recover any that already have cache_info registered. Each recovered signal is put into the queue individually (single-request batch) to ensure all ranks have identical batch granularity regardless of recovery timing.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-10 17:47:48

📋 Review 摘要

PR 概述：修复 PD separation 中 layer0 signal 因 cache_info 异步注册导致的跨 rank 入队粒度不一致问题。
变更范围：fastdeploy/cache_manager/cache_messager.py
影响面 Tag：KVCache PD Disaggregation

问题

未发现阻塞性问题。PR 规范问题在下面章节报，不在这里重复。

历史 Findings 修复情况

Finding	问题	状态
F1	原实现会把当前已注册 cache_info 的 pending layer0 signal 聚合成一个 batch 入队，仍依赖本地注册状态。	✅ 已修复

📝 PR 规范检查

标题符合 [BugFix] Tag 规范；PR 描述缺少当前仓库要求的 ## Motivation、## Modifications、## Usage or Command、## Accuracy Tests、## Checklist 完整结构。

PR 描述建议（点击展开，可直接复制）

## Motivation
Fix a deadlock in PD separation where layer0 KV signals can be batched differently across ranks when `cache_info` is registered asynchronously.

## Modifications
- Route layer0 signals through `pending_layer0_signals` in `fastdeploy/cache_manager/cache_messager.py`.
- Recover layer0 signals after the corresponding cache task information is available.

## Usage or Command
N/A

## Accuracy Tests
N/A. This change affects cache transfer synchronization and does not change model accuracy.

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

本轮重点检查了 layer0 signal 的两个恢复入口和发送线程的 barrier 调用路径。当前 diff 已将 consume_signals 中的就绪 batch 分支改为单个 signal 恢复入队，_add_cache_task_thread 中 cache_info 先到的恢复路径也保持单个 signal batch，未发现新的阻塞性代码问题。

codecov-commenter · 2026-06-10T09:53:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@1eed8ba). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8034   +/-   ##
==========================================
  Coverage           ?   67.63%           
==========================================
  Files              ?      470           
  Lines              ?    66108           
  Branches           ?    10186           
==========================================
  Hits               ?    44710           
  Misses             ?    18551           
  Partials           ?     2847

Flag	Coverage Δ
GPU	`77.72% <100.00%> (?)`
XPU	`7.01% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevincheng2 had a problem deploying to Metax_ci June 10, 2026 07:57 — with GitHub Actions Failure

kevincheng2 mentioned this pull request Jun 10, 2026

[Cherry-Pick][BugFix] fix consume_signals barrier deadlock in PD separation(#8034) #8035

Open

kevincheng2 force-pushed the fix/consume-signals-barrier-deadlock branch from f6f19e0 to efdaa94 Compare June 10, 2026 08:34

kevincheng2 had a problem deploying to Metax_ci June 10, 2026 08:34 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

kevincheng2 force-pushed the fix/consume-signals-barrier-deadlock branch from efdaa94 to 87a8601 Compare June 10, 2026 09:16

kevincheng2 had a problem deploying to Metax_ci June 10, 2026 09:16 — with GitHub Actions Failure

PaddlePaddle-bot reviewed Jun 10, 2026

View reviewed changes

Sunny-bot1 approved these changes Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] fix consume_signals barrier deadlock in PD separation#8034

[BugFix] fix consume_signals barrier deadlock in PD separation#8034
kevincheng2 wants to merge 1 commit into
PaddlePaddle:developfrom
kevincheng2:fix/consume-signals-barrier-deadlock

kevincheng2 commented Jun 10, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

codecov-commenter commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kevincheng2 commented Jun 10, 2026

Problem

Fix

Testing

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

codecov-commenter commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 10, 2026 •

edited

Loading