[XPU] add as timeout#7967
Conversation
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 8/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题:PD IPC Decode 侧请求超时(置信度: 高)失败用例:
关键日志:
PR 修改 修复建议:
关联变更: 🔴 Approval — 需要 Approval(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 修复建议: 请通过人工审批。 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-02 12:53:33
📋 Review 摘要
PR 概述:为 AS(AttentionStore)的 write-back/prefetch 等待操作添加超时机制,防止存储服务异常时无限阻塞调度主流程。
变更范围:cache_manager、envs
影响面 Tag:[KVCache] [XPU]
问题
未发现阻塞性问题。
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | 超时后 storage_prefetch_block_ids 可能产生孤儿条目 |
说明:当
wait_prefetch_storage_task超时后执行self.storage_prefetch_block_ids.pop(req_id, None),但后台recv_data_transfer_result线程仍可能在之后通过self.storage_prefetch_block_ids[task_id] = [](L2206)重新创建该条目,导致永远无法被清理的孤儿条目。建议在recv_data_transfer_result中设置 event 前检查 event 是否仍存在于字典中,若不存在则跳过写入storage_prefetch_block_ids。
📝 PR 规范检查
PR 描述各段落(Motivation、Modifications、Usage or Command、Accuracy Tests)均为空或仅保留模板占位符,不符合 §D2 要求。Checklist 均未勾选。
标题建议(可直接复制):
[XPU] Add timeout for AS write-back/prefetch tasks to avoid blocking scheduling
PR 描述建议(点击展开,可直接复制)
## Motivation
为 AS(AttentionStore)的 write-back 和 prefetch 等待操作添加超时机制,防止在存储服务异常时无限阻塞调度主流程。
## Modifications
- 在 `prefix_cache_manager.py` 的 `wait_write_storage_task` 和 `wait_prefetch_storage_task` 方法中,将无限等待改为带超时的 `event.wait(timeout=...)`,超时后记录错误日志并跳过。
- 在 `envs.py` 中新增环境变量 `FD_AS_WAIT_TIMEOUT`(默认 45 秒),用于控制超时时间。
## Usage or Command
通过环境变量配置超时时间:
```bash
export FD_AS_WAIT_TIMEOUT=45
```
## Accuracy Tests
N/A(本次修改不影响模型精度)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
变更思路正确,通过超时避免无限阻塞是合理的防御措施。代码实现简洁清晰,del 改为 .pop(req_id, None) 和 .get(req_id, []) 增强了超时场景下的容错性。历史遗留的孤儿条目问题仍建议后续修复。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7967 +/- ##
==========================================
Coverage ? 67.88%
==========================================
Files ? 467
Lines ? 65197
Branches ? 10010
==========================================
Hits ? 44258
Misses ? 18095
Partials ? 2844
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.