Skip to content

[XPU] fix_same_req_id#8040

Open
cmcamdy wants to merge 1 commit into
PaddlePaddle:developfrom
cmcamdy:fix_same_req_id
Open

[XPU] fix_same_req_id#8040
cmcamdy wants to merge 1 commit into
PaddlePaddle:developfrom
cmcamdy:fix_same_req_id

Conversation

@cmcamdy

@cmcamdy cmcamdy commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@fab344e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8040   +/-   ##
==========================================
  Coverage           ?   67.72%           
==========================================
  Files              ?      471           
  Lines              ?    66361           
  Branches           ?    10217           
==========================================
  Hits               ?    44946           
  Misses             ?    18546           
  Partials           ?     2869           
Flag Coverage Δ
GPU 77.79% <80.00%> (?)
XPU 6.99% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-11 17:13:21

📋 Review 摘要

PR 概述:在 PD decode 预分配资源时新增重复 request_id 拒绝逻辑,并保留 D 侧返回的错误原因。
变更范围fastdeploy/engine/common_engine.pyfastdeploy/engine/sched/resource_manager_v1.py
影响面 Tag[Engine] [Scheduler] [PD Disaggregation]

问题

级别 文件 概述
🔴 Bug fastdeploy/engine/sched/resource_manager_v1.py:1596 cache-task 模式下重复 request_id 被当作资源不足重试,P/D 会永久等待

📝 PR 规范检查

标题 Tag 使用 [XPU],但本次 diff 修改的是 Engine/Scheduler 的 PD decode 资源预分配逻辑,未触及 XPU 专用 worker/model_runner/ops;PR 描述仍是模板占位内容,缺少具体 Motivation/Modifications/Usage/Accuracy Tests 内容。建议替换为以下完整内容。

标题建议(可直接复制):

  • [PD Disaggregation] Fix duplicate request id handling in decode preallocation
PR 描述建议(点击展开,可直接复制)
## Motivation
修复 P/D 分离场景下 Decode 侧收到重复 request_id 时可能复用或污染已有 KV cache 的问题。

## Modifications
- `fastdeploy/engine/sched/resource_manager_v1.py`: 在 Decode 侧资源预分配时检测 `request_id` 是否已存在于 `self.requests`,重复时设置错误信息并拒绝分配。
- `fastdeploy/engine/common_engine.py`: 在资源预分配失败回传给 Prefill 时保留 Decode 侧已经设置的错误原因,避免统一覆盖为 `Not enough resources`## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复方向能避免 D 侧同一 request_id 复用已有 block,但当前永久失败和临时资源不足共用 False,会在 cache-task 模式下让重复请求卡住。需要先拆分失败语义,或在已有 error_msg 时回传错误并移除队列。

Comment thread fastdeploy/engine/sched/resource_manager_v1.py
@PaddlePaddle-bot

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-13 15:17:33 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: cb98744 | Merge base: fab344e (branch: develop)


1 Required任务 : 9/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 37 4 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题 Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例: 差异覆盖率校验

用例 错误摘要
diff-cover fastdeploy/engine/common_engine.py 新增行未覆盖,diff coverage 66%,低于 80% 阈值

关键日志:

Failure. Coverage is below 80%.
Diff Coverage
fastdeploy/engine/common_engine.py (0.0%): Missing lines 2137-2138
fastdeploy/engine/sched/resource_manager_v1.py (100%)
TEST_EXIT_CODE: 0
COVERAGE_EXIT_CODE: 9
"total_num_violations": 2, "total_percent_covered": 66
  • 根因摘要: common_engine新增分支缺少测试覆盖
    PR 在 fastdeploy/engine/common_engine.py 将资源预分配失败时的错误写入改成 if not task.get("error_msg", None) 后再设置默认 "Not enough resources"。CI 单测本身通过,但新增 guard/fallback 分支没有被测试覆盖,导致 diff-cover --fail-under=80 返回 9。

修复建议:

  1. tests/engine/test_common_engine.py 为 decode splitwise 资源预分配失败路径补充用例,覆盖 fastdeploy/engine/common_engine.py:2128-2129:构造 preallocate_resource_in_d() 返回 False 且 task 已带 error_msg="Duplicate request id in decode" 的场景,断言发送给 prefill 的 task 保留原错误而不是被 "Not enough resources" 覆盖;同时可覆盖无 error_msg 时写入默认错误的分支。

关联变更: fastdeploy/engine/common_engine.py:2128-2129, fastdeploy/engine/sched/resource_manager_v1.py:1590-1596

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants