Skip to content

Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969

Open
ZhangX-21 wants to merge 2 commits into
PaddlePaddle:developfrom
ZhangX-21:piecewise_cudagraph
Open

Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill#7969
ZhangX-21 wants to merge 2 commits into
PaddlePaddle:developfrom
ZhangX-21:piecewise_cudagraph

Conversation

@ZhangX-21

Copy link
Copy Markdown
Contributor

Motivation

This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase.

Modifications

  • Revert blockwise CUDAGraph related logic.
  • Support piecewise CUDAGraph for prefill.
  • Capture reusable graph segments inside the prefill phase.
  • Refactor prefill CUDAGraph capture/replay control flow.
  • Keep decode CUDAGraph behavior unchanged.

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 2, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-08 13:05:54 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 342bf2d | Merge base: 4474188 (branch: develop)


1 Required任务 : 7/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 35 7 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题 Job
Approval 需要 Approval Job
Run Four Cards Tests / run_4_cards_tests PR问题 Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

失败用例:

用例 错误摘要
tests/model_executor/test_ep.py::test_eprunner_moe_select_noaux_tc_without_redundant TypeError: scores + e_score_correction_biase_score_correction_bias 为 None

关键日志:

>       scores_with_bias = scores + e_score_correction_bias
E       TypeError: (InvalidType) __add__(): argument (position 1) must be int, float, bool or Tensor, but got NoneType

fastdeploy/model_executor/layers/moe/moe.py:118: TypeError
  • 根因摘要: PR 删除了 assert e_score_correction_bias is not None,但未处理 None 时的下游逻辑

PR 在 moe.py:106 删除了断言 assert e_score_correction_bias is not None,使得 None 可以传入后续代码。当 expert_id_to_ep_rank_array is None and not use_fused_cast 时,代码直接执行 scores + e_score_correction_bias,而测试用例中 gate_correction_bias=None,触发 TypeError

修复建议:

  1. fastdeploy/model_executor/layers/moe/moe.py 第 118 行附近,对 e_score_correction_bias 为 None 做条件判断:scores_with_bias = scores + e_score_correction_bias if e_score_correction_bias is not None else scores
  2. 或在 ep.pymoe_select 调用处确保 e_score_correction_bias 不为 None
  3. 同步更新测试用例,验证 gate_correction_bias=None 时的正确行为

关联变更:

  • fastdeploy/model_executor/layers/moe/moe.py:106: 删除 assert e_score_correction_bias is not None(直接触发失败)
  • fastdeploy/model_executor/layers/moe/ep.py: 将 get_moe_scores import 提到模块级
🔴 Approval — 需要 Approval(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

🔴 Run Four Cards Tests / run_4_cards_tests — PR问题(置信度: 中)

失败用例:

用例 错误摘要
test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy EOFError: Ran out of input,paddle.io.load 读取到空/截断的 pickle 文件

关键日志:

tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py:208: in test_r3_accuracy
tests/e2e/utils/rollout_routing_replay_test_utils.py:185: in check_routing_replay_chat_completion
paddle/framework/io.py:1275: in load
paddle/framework/restricted_unpickler.py:227: in safe_load_pickle
E   EOFError: Ran out of input
FAILED tests/e2e/4cards_cases/test_GLM_45_AIR_mtp_tp4.py::test_r3_accuracy
==================== 1 failed, 1 passed in 77.84s (0:01:17) ====================
  • 根因摘要: prefill CUDAGraph capture 阶段 routing replay pickle 文件写入为空

PR 新增 @prefill_cudagraph_guard(True) 装饰 capture_model_prefill_and_mixed,同时引入全局 in_prefill_cudagraph_mode guard。test_r3_accuracy 通过 routing replay 机制验证推理精度,该机制依赖将路由数据序列化写入 pickle 文件。如果 routing replay 保存逻辑在 in_prefill_cudagraph_mode 激活期间有条件跳过写文件,或 piecewise CUDAGraph 对 prefill 路径的重构导致路由数据未持久化,均会造成 pickle 文件为空,从而触发 EOFError: Ran out of input

修复建议:

  1. 检查 rollout_routing_replay_test_utils.py 及 routing replay manager 中是否对 in_prefill_cudagraph_mode 有条件判断,确保 prefill CUDAGraph capture 阶段路由数据仍可正确写入
  2. 排查 capture_model_prefill_and_mixed 中 piecewise CUDAGraph capture 与 routing replay 保存机制是否存在冲突(CUDAGraph capture 期间 CPU 侧文件 I/O 是否被阻断)
  3. 本地复现:运行 4 卡 GLM-4.5-AIR-MTP-TP4 推理服务后检查生成的 routing replay pickle 文件是否为空

关联变更:

  • fastdeploy/worker/gpu_model_runner.py: capture_model_prefill_and_mixed 添加 @prefill_cudagraph_guard(True)
  • fastdeploy/model_executor/graph_optimization/utils.py: 新增 prefill_cudagraph_guard, in_prefill_cudagraph_mode

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 62.50000% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@4474188). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/model_executor/models/glm4_moe.py 0.00% 3 Missing ⚠️
fastdeploy/model_executor/layers/normalization.py 71.42% 1 Missing and 1 partial ⚠️
...tdeploy/model_executor/graph_optimization/utils.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7969   +/-   ##
==========================================
  Coverage           ?   67.45%           
==========================================
  Files              ?      466           
  Lines              ?    65196           
  Branches           ?    10015           
==========================================
  Hits               ?    43976           
  Misses             ?    18382           
  Partials           ?     2838           
Flag Coverage Δ
GPU 77.68% <66.66%> (?)
XPU 7.10% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-02 16:26:10

📋 Review 摘要

PR 概述:回退 blockwise CUDAGraph 实现,改为在 prefill 阶段支持 piecewise CUDAGraph
变更范围model_executor/graph_optimization/worker/model_executor/layers/config.py
影响面 Tag[Graph Optimization] [Models] [FDConfig]

问题

未发现阻塞性问题。

历史 Findings 修复情况

Finding 问题 状态
F1 glm4_moe.py isinstance 分支两侧代码完全相同 ⚠️ 仍存在
F2 append_attn_backend.py 断言代码 ✅ 已修复(断言已完全删除,而非注释保留)
F3 in_prefill_cudagraph_mode 无消费方 ⚠️ 仍存在

📝 PR 规范检查

标题缺少官方 Tag,描述结构完整但 Usage or Command 和 Accuracy Tests 仅为占位注释。

标题建议(可直接复制):

  • [Graph Optimization] Revert blockwise CUDAGraph and support piecewise CUDAGraph in prefill
PR 描述建议(点击展开,可直接复制)
## Motivation

This PR reverts the previous blockwise CUDAGraph implementation and adds support for piecewise CUDAGraph in the prefill phase. The blockwise approach captured per-layer graphs which fragmented SOT-compiled graphs; the piecewise approach captures reusable sub-graph segments during prefill without graph fragmentation.

## Modifications

- Revert blockwise CUDAGraph related logic (remove `cuda_graph_op.py`, env vars `FD_USE_BLOCK_WISE_CUDA_GRAPH` / `FD_BLOCK_WISE_CUDA_GRAPH_SIZES`).
- Add `prefill_cudagraph_guard` to skip block-wise wrappers during prefill capture.
- Extend prefill capture sizes up to 8192 tokens in `config.py`.
- Refactor `RMSNorm.forward`: remove dtype cast, use dynamic `max_chunk_tokens` for allreduce fusion.
- Pass `max_token_num` to `flashinfer_allreduce_residual_rmsnorm` to match workspace allocation.
- Keep decode CUDAGraph behavior unchanged.

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

Revert 逻辑干净完整,blockwise CUDAGraph 的代码、环境变量、测试均已移除。piecewise CUDAGraph 的新增改动最小化(仅新增 prefill_cudagraph_guard 并扩展 capture sizes)。normalization.pyhas_residual 重构对 SOT/CUDAGraph 友好且逻辑正确。建议修复历史遗留的 F1(glm4_moe.py 死分支)和 F3(in_prefill_cudagraph_mode 暂无消费方需补充使用或注释说明)。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants