Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969
Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969ZhangX-21 wants to merge 2 commits into
Conversation
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 7/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)失败用例:
关键日志:
PR 在 修复建议:
关联变更:
🔴 Approval — 需要 Approval(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 🔴 Run Four Cards Tests / run_4_cards_tests — PR问题(置信度: 中)失败用例:
关键日志:
PR 新增 修复建议:
关联变更:
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7969 +/- ##
==========================================
Coverage ? 67.45%
==========================================
Files ? 466
Lines ? 65196
Branches ? 10015
==========================================
Hits ? 43976
Misses ? 18382
Partials ? 2838
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
29f8a1e to
342bf2d
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-02 16:26:10
📋 Review 摘要
PR 概述:回退 blockwise CUDAGraph 实现,改为在 prefill 阶段支持 piecewise CUDAGraph
变更范围:model_executor/graph_optimization/、worker/、model_executor/layers/、config.py
影响面 Tag:[Graph Optimization] [Models] [FDConfig]
问题
未发现阻塞性问题。
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | glm4_moe.py isinstance 分支两侧代码完全相同 |
|
| F2 | append_attn_backend.py 断言代码 |
✅ 已修复(断言已完全删除,而非注释保留) |
| F3 | in_prefill_cudagraph_mode 无消费方 |
📝 PR 规范检查
标题缺少官方 Tag,描述结构完整但 Usage or Command 和 Accuracy Tests 仅为占位注释。
标题建议(可直接复制):
[Graph Optimization] Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill
PR 描述建议(点击展开,可直接复制)
## Motivation
This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase. The blockwise approach captured per-layer graphs which fragmented SOT-compiled graphs; the piecewise approach captures reusable sub-graph segments during prefill without graph fragmentation.
## Modifications
- Revert blockwise CUDAGraph related logic (remove `cuda_graph_op.py`, env vars `FD_USE_BLOCK_WISE_CUDA_GRAPH` / `FD_BLOCK_WISE_CUDA_GRAPH_SIZES`).
- Add `prefill_cudagraph_guard` to skip block-wise wrappers during prefill capture.
- Extend prefill capture sizes up to 8192 tokens in `config.py`.
- Refactor `RMSNorm.forward`: remove dtype cast, use dynamic `max_chunk_tokens` for allreduce fusion.
- Pass `max_token_num` to `flashinfer_allreduce_residual_rmsnorm` to match workspace allocation.
- Keep decode CUDAGraph behavior unchanged.
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
Revert 逻辑干净完整,blockwise CUDAGraph 的代码、环境变量、测试均已移除。piecewise CUDAGraph 的新增改动最小化(仅新增 prefill_cudagraph_guard 并扩展 capture sizes)。normalization.py 的 has_residual 重构对 SOT/CUDAGraph 友好且逻辑正确。建议修复历史遗留的 F1(glm4_moe.py 死分支)和 F3(in_prefill_cudagraph_mode 暂无消费方需补充使用或注释说明)。
Motivation
This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase.
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.