Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill by ZhangX-21 · Pull Request #7969 · PaddlePaddle/FastDeploy

ZhangX-21 · 2026-06-02T05:22:53Z

Motivation

This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase.

Modifications

Revert blockwise CUDAGraph related logic.
Support piecewise CUDAGraph for prefill.
Capture reusable graph segments inside the prefill phase.
Refactor prefill CUDAGraph capture/replay control flow.
Keep decode CUDAGraph behavior unchanged.

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot · 2026-06-02T05:37:21Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-08 13:05:54 UTC+08:00

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: 342bf2d | Merge base: 4474188 (branch: develop)

1 Required任务 : 7/10 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
42(0)	42	35	7	0	0	0

任务	错误类型	置信度	日志
`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	PR问题	高	Job
`Approval`	需要 Approval	高	Job
`Run Four Cards Tests / run_4_cards_tests`	PR问题	中	Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题（置信度: 高）

失败用例:

用例	错误摘要
`tests/model_executor/test_ep.py::test_eprunner_moe_select_noaux_tc_without_redundant`	`TypeError`: `scores + e_score_correction_bias`，`e_score_correction_bias` 为 None

关键日志:

>       scores_with_bias = scores + e_score_correction_bias
E       TypeError: (InvalidType) __add__(): argument (position 1) must be int, float, bool or Tensor, but got NoneType

fastdeploy/model_executor/layers/moe/moe.py:118: TypeError

根因摘要: PR 删除了 assert e_score_correction_bias is not None，但未处理 None 时的下游逻辑

PR 在 moe.py:106 删除了断言 assert e_score_correction_bias is not None，使得 None 可以传入后续代码。当 expert_id_to_ep_rank_array is None and not use_fused_cast 时，代码直接执行 scores + e_score_correction_bias，而测试用例中 gate_correction_bias=None，触发 TypeError。

修复建议:

在 fastdeploy/model_executor/layers/moe/moe.py 第 118 行附近，对 e_score_correction_bias 为 None 做条件判断：scores_with_bias = scores + e_score_correction_bias if e_score_correction_bias is not None else scores
或在 ep.py 的 moe_select 调用处确保 e_score_correction_bias 不为 None
同步更新测试用例，验证 gate_correction_bias=None 时的正确行为

关联变更:

fastdeploy/model_executor/layers/moe/moe.py:106: 删除 assert e_score_correction_bias is not None（直接触发失败）
fastdeploy/model_executor/layers/moe/ep.py: 将 get_moe_scores import 提到模块级

🔴 Approval — 需要 Approval（置信度: 高）

该 Job 需要人工 Approval，完成审批后 CI 才会继续执行。

🔴 Run Four Cards Tests / run_4_cards_tests — PR问题（置信度: 中）

失败用例:

用例	错误摘要
`test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy`	`EOFError: Ran out of input`，paddle.io.load 读取到空/截断的 pickle 文件

关键日志:

tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py:208: in test_r3_accuracy
tests/e2e/utils/rollout_routing_replay_test_utils.py:185: in check_routing_replay_chat_completion
paddle/framework/io.py:1275: in load
paddle/framework/restricted_unpickler.py:227: in safe_load_pickle
E   EOFError: Ran out of input
FAILED tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy
==================== 1 failed, 1 passed in 77.84s (0:01:17) ====================

根因摘要: prefill CUDAGraph capture 阶段 routing replay pickle 文件写入为空

PR 新增 @prefill_cudagraph_guard(True) 装饰 capture_model_prefill_and_mixed，同时引入全局 in_prefill_cudagraph_mode guard。test_r3_accuracy 通过 routing replay 机制验证推理精度，该机制依赖将路由数据序列化写入 pickle 文件。如果 routing replay 保存逻辑在 in_prefill_cudagraph_mode 激活期间有条件跳过写文件，或 piecewise CUDAGraph 对 prefill 路径的重构导致路由数据未持久化，均会造成 pickle 文件为空，从而触发 EOFError: Ran out of input。

修复建议:

检查 rollout_routing_replay_test_utils.py 及 routing replay manager 中是否对 in_prefill_cudagraph_mode 有条件判断，确保 prefill CUDAGraph capture 阶段路由数据仍可正确写入
排查 capture_model_prefill_and_mixed 中 piecewise CUDAGraph capture 与 routing replay 保存机制是否存在冲突（CUDAGraph capture 期间 CPU 侧文件 I/O 是否被阻断）
本地复现：运行 4 卡 GLM-4.5-AIR-MTP-TP4 推理服务后检查生成的 routing replay pickle 文件是否为空

关联变更:

fastdeploy/worker/gpu_model_runner.py: capture_model_prefill_and_mixed 添加 @prefill_cudagraph_guard(True)
fastdeploy/model_executor/graph_optimization/utils.py: 新增 prefill_cudagraph_guard, in_prefill_cudagraph_mode

codecov-commenter · 2026-06-02T06:09:25Z

Codecov Report

❌ Patch coverage is 62.50000% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@4474188). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/models/glm4_moe.py	0.00%	3 Missing ⚠️
fastdeploy/model_executor/layers/normalization.py	71.42%	1 Missing and 1 partial ⚠️
...tdeploy/model_executor/graph_optimization/utils.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7969   +/-   ##
==========================================
  Coverage           ?   67.45%           
==========================================
  Files              ?      466           
  Lines              ?    65196           
  Branches           ?    10015           
==========================================
  Hits               ?    43976           
  Misses             ?    18382           
  Partials           ?     2838

Flag	Coverage Δ
GPU	`77.68% <66.66%> (?)`
XPU	`7.10% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-02 16:26:10

📋 Review 摘要

PR 概述：回退 blockwise CUDAGraph 实现，改为在 prefill 阶段支持 piecewise CUDAGraph
变更范围：model_executor/graph_optimization/、worker/、model_executor/layers/、config.py
影响面 Tag：[Graph Optimization] [Models] [FDConfig]

问题

未发现阻塞性问题。

历史 Findings 修复情况

Finding	问题	状态
F1	`glm4_moe.py` isinstance 分支两侧代码完全相同	⚠️ 仍存在
F2	`append_attn_backend.py` 断言代码	✅ 已修复（断言已完全删除，而非注释保留）
F3	`in_prefill_cudagraph_mode` 无消费方	⚠️ 仍存在

📝 PR 规范检查

标题缺少官方 Tag，描述结构完整但 Usage or Command 和 Accuracy Tests 仅为占位注释。

标题建议（可直接复制）：

[Graph Optimization] Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill

PR 描述建议（点击展开，可直接复制）

## Motivation

This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase. The blockwise approach captured per-layer graphs which fragmented SOT-compiled graphs; the piecewise approach captures reusable sub-graph segments during prefill without graph fragmentation.

## Modifications

- Revert blockwise CUDAGraph related logic (remove `cuda_graph_op.py`, env vars `FD_USE_BLOCK_WISE_CUDA_GRAPH` / `FD_BLOCK_WISE_CUDA_GRAPH_SIZES`).
- Add `prefill_cudagraph_guard` to skip block-wise wrappers during prefill capture.
- Extend prefill capture sizes up to 8192 tokens in `config.py`.
- Refactor `RMSNorm.forward`: remove dtype cast, use dynamic `max_chunk_tokens` for allreduce fusion.
- Pass `max_token_num` to `flashinfer_allreduce_residual_rmsnorm` to match workspace allocation.
- Keep decode CUDAGraph behavior unchanged.

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

Revert 逻辑干净完整，blockwise CUDAGraph 的代码、环境变量、测试均已移除。piecewise CUDAGraph 的新增改动最小化（仅新增 prefill_cudagraph_guard 并扩展 capture sizes）。normalization.py 的 has_residual 重构对 SOT/CUDAGraph 友好且逻辑正确。建议修复历史遗留的 F1（glm4_moe.py 死分支）和 F3（in_prefill_cudagraph_mode 暂无消费方需补充使用或注释说明）。

ZhangX-21 had a problem deploying to Metax_ci June 2, 2026 05:22 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill

342bf2d

ZhangX-21 force-pushed the piecewise_cudagraph branch from 29f8a1e to 342bf2d Compare June 2, 2026 08:17

ZhangX-21 had a problem deploying to Metax_ci June 2, 2026 08:17 — with GitHub Actions Failure

PaddlePaddle-bot reviewed Jun 2, 2026

View reviewed changes

Support piecewise CUDAGraph for MTP execution

21a35d9

ZhangX-21 had a problem deploying to Metax_ci June 9, 2026 03:15 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969

Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969
ZhangX-21 wants to merge 2 commits into
PaddlePaddle:developfrom
ZhangX-21:piecewise_cudagraph

ZhangX-21 commented Jun 2, 2026

Uh oh!

PaddlePaddle-bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Jun 2, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ZhangX-21 commented Jun 2, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

PaddlePaddle-bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 Required任务 : 7/10 通过

2 失败详情

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

历史 Findings 修复情况

📝 PR 规范检查

总体评价

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PaddlePaddle-bot commented Jun 2, 2026 •

edited

Loading

codecov-commenter commented Jun 2, 2026 •

edited

Loading