Skip to content

[XPU] add as timeout#7967

Open
cmcamdy wants to merge 2 commits into
PaddlePaddle:developfrom
cmcamdy:add_as_timeout
Open

[XPU] add as timeout#7967
cmcamdy wants to merge 2 commits into
PaddlePaddle:developfrom
cmcamdy:add_as_timeout

Conversation

@cmcamdy

@cmcamdy cmcamdy commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 2, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-12 18:51:10 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: f3dacd3 | Merge base: b0e2e01 (branch: develop)


1 Required任务 : 8/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 36 6 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题:PD IPC Decode 侧请求超时 Job
Approval 需要 Approval Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题:PD IPC Decode 侧请求超时(置信度: 高)

失败用例:

用例 错误摘要
e2e/test_ernie_03b_pd_router_v1_ipc.py::test_non_chat_usage_non_stream, 等 3 个 send_request 返回 None,请求在 60s 后被 Decode 侧 abort

关键日志:

[decode] 13:29:18 Resource available, processing task cmpl-f9fa588c::n::0
[decode] 13:30:18 Receive abort request, req_id: cmpl-f9fa588c::n::0
[decode] 13:31:19 ConnectionResetError: [Errno 104] Connection reset by peer
           ERROR: Error in main loop of decode_process_splitwise_requests
[prefill] 13:31:19 WARNING: wait for sending cache, chatcmpl-ab736883... (repeated)
[prefill] 13:31:19 ERROR: while get input_data error: Connection reset by peer
  • 根因摘要: PD IPC 拆分下 Decode 侧请求无响应 60s 超时

PR 修改 wait_prefetch_storage_taskprefix_cache_manager.py:1298-1312):AS 预取超时(45s)时改为返回 [] 空 block IDs 并继续执行。Decode 侧以空 block IDs 继续调度时无法完成 KV Cache 传输,请求挂起直到 60s 客户端超时被 abort;同时 wait_write_storage_task 超时跳过写回后,Prefill 侧可能向 Decode 发送尚未写入的无效 block IDs,导致 IPC 通道最终崩溃(ConnectionResetError/BrokenPipeError)。

修复建议:

  1. fastdeploy/cache_manager/prefix_cache_manager.py 第 1309 行:wait_prefetch_storage_task 超时后应返回 None 而非 [],调用方需检测 None 并拒绝/重试该请求,避免以空 block IDs 继续 Decode 调度
  2. fastdeploy/cache_manager/prefix_cache_manager.py 第 1275 行:wait_write_storage_task 超时跳过后,须在 Prefill 侧标记该 block IDs 为无效,阻止将其通过 IPC 发送给 Decode;否则 Decode 收到无效 block IDs 会触发 IPC 通道崩溃

关联变更: fastdeploy/cache_manager/prefix_cache_manager.py:1264-1279(wait_write_storage_task),1298-1312(wait_prefetch_storage_task)

🔴 Approval — 需要 Approval(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

修复建议: 请通过人工审批。

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-02 12:53:33

📋 Review 摘要

PR 概述:为 AS(AttentionStore)的 write-back/prefetch 等待操作添加超时机制,防止存储服务异常时无限阻塞调度主流程。
变更范围:cache_manager、envs
影响面 Tag[KVCache] [XPU]

问题

未发现阻塞性问题。

历史 Findings 修复情况

Finding 问题 状态
F1 超时后 storage_prefetch_block_ids 可能产生孤儿条目 ⚠️ 仍存在

说明:当 wait_prefetch_storage_task 超时后执行 self.storage_prefetch_block_ids.pop(req_id, None),但后台 recv_data_transfer_result 线程仍可能在之后通过 self.storage_prefetch_block_ids[task_id] = [](L2206)重新创建该条目,导致永远无法被清理的孤儿条目。建议在 recv_data_transfer_result 中设置 event 前检查 event 是否仍存在于字典中,若不存在则跳过写入 storage_prefetch_block_ids

📝 PR 规范检查

PR 描述各段落(Motivation、Modifications、Usage or Command、Accuracy Tests)均为空或仅保留模板占位符,不符合 §D2 要求。Checklist 均未勾选。

标题建议(可直接复制):

  • [XPU] Add timeout for AS write-back/prefetch tasks to avoid blocking scheduling
PR 描述建议(点击展开,可直接复制)
## Motivation

为 AS(AttentionStore)的 write-back 和 prefetch 等待操作添加超时机制,防止在存储服务异常时无限阻塞调度主流程。

## Modifications

-`prefix_cache_manager.py``wait_write_storage_task``wait_prefetch_storage_task` 方法中,将无限等待改为带超时的 `event.wait(timeout=...)`,超时后记录错误日志并跳过。
-`envs.py` 中新增环境变量 `FD_AS_WAIT_TIMEOUT`(默认 45 秒),用于控制超时时间。

## Usage or Command

通过环境变量配置超时时间:
```bash
export FD_AS_WAIT_TIMEOUT=45
```

## Accuracy Tests

N/A(本次修改不影响模型精度)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

变更思路正确,通过超时避免无限阻塞是合理的防御措施。代码实现简洁清晰,del 改为 .pop(req_id, None).get(req_id, []) 增强了超时场景下的容错性。历史遗留的孤儿条目问题仍建议后续修复。

@codecov-commenter

codecov-commenter commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@b0e2e01). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/prefix_cache_manager.py 66.66% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7967   +/-   ##
==========================================
  Coverage           ?   67.88%           
==========================================
  Files              ?      467           
  Lines              ?    65197           
  Branches           ?    10010           
==========================================
  Hits               ?    44258           
  Misses             ?    18095           
  Partials           ?     2844           
Flag Coverage Δ
GPU 78.17% <66.66%> (?)
XPU 7.08% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants