[Cherry-Pick][XPU] Enable CudaGraph capture for MTP draft model(#7864)#7941
[Cherry-Pick][XPU] Enable CudaGraph capture for MTP draft model(#7864)#7941Clarity256 wants to merge 4 commits into
Conversation
|
Thanks for your contribution! |
|
root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Based on PR PaddlePaddle#7864, this adds draft model CudaGraph support: - Enable step_use_cudagraph for draft model with proper gating logic - Pass forward_meta and use_cudagraph to xpu_pre_process in draft path - Add padding_cudagraph_inputs() for draft model buffer management - Slice model output by real_token_num when graph is active - Include target model cudagraph changes (xpu_model_runner, xpu_pre_and_post_process) - Add build_sampling_params XPU kernel for MTP Verified stable with benchmark_serving (100 requests, concurrency=16, 0 failures). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5a6882b to
bf150a4
Compare
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 8/10 通过
2 失败详情🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 环境问题(置信度: 高)失败用例:
关键日志:
修复建议:
关联变更: 本 PR 变更文件为 🔴 Approval — 需要 Approval(置信度: 高)该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。 修复建议: 请通过人工审批。 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7941 +/- ##
==========================================
Coverage ? 66.98%
==========================================
Files ? 467
Lines ? 65630
Branches ? 10113
==========================================
Hits ? 43962
Misses ? 18885
Partials ? 2783
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4ef7a1f to
00299e4
Compare
When TP>1 with MTP+CudaGraph, the proposer's dummy_prefill_inputs was hardcoded with expected_decode_len=1, causing shape mismatch during CudaGraph capture. Use the actual expected_decode_len variable instead.
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-03 11:03:46
📋 Review 摘要
PR 概述:为 XPU 平台 MTP 投机解码 draft model 补齐 CUDAGraph capture 支持,新增 build_sampling_params XPU 算子替代 Python 侧 padding_sampling_params,并适配 warmup/capture 流程。
变更范围:custom_ops/xpu_ops/(新算子内核)、fastdeploy/spec_decode/mtp_xpu.py、fastdeploy/worker/xpu_model_runner.py、fastdeploy/model_executor/layers/sample/sampler.py、fastdeploy/model_executor/xpu_pre_and_post_process.py
影响面 Tag:[XPU] [Speculative Decoding] [OP] [Graph Optimization]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/spec_decode/mtp_xpu.py:113 |
substep 参数未在 _initialize_forward_meta 签名中声明,调用时会触发 TypeError |
| 🟡 建议 | fastdeploy/worker/xpu_model_runner.py:1210 |
capture_size 整除截断问题(历史 F4/F8 仍存在) |
| 🟡 建议 | custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu:75 |
pad_start 计算 O(bs²) 性能问题(历史 F5 仍存在) |
| 🟡 建议 | fastdeploy/spec_decode/mtp_xpu.py:329 |
遗留大量注释代码,降低可读性(历史 F7 仍存在) |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | _initialize_forward_meta 方法签名缺少 substep 参数 |
|
| F2 | infer_seed 更新路径依赖 build_sampling_params 内部递增,非投机解码路径仍走外部更新 |
✅ 已修复(diff 中 if not self.speculative_decoding 分支与 build_sampling_params 算子内部更新配合,逻辑闭环) |
| F3 | _initialize_forward_meta 签名缺 substep |
|
| F4 | capture_size 整除截断 batch_size 偏小 |
|
| F5 | pad_start 计算 O(bs²) 性能问题 |
|
| F6 | sampling_metadata.topp_seed 语义不一致 |
✅ 已修复(_normal_sample_xpu 中直接使用 sampling_metadata.topp_seed,与 build_sampling_params 算子输出对齐) |
| F7 | 遗留大量注释代码 | |
| F8 | capture_size 整除截断(同 F4) |
|
| F9 | 移除了 proposer.fd_config.parallel_config.moe_phase.phase 运行时更新 |
🔄 部分修复(原逻辑已拆分:use_ep 判断移至统一的 if self.fd_config.parallel_config.use_ep 分支,但 proposer 侧的 moe_phase.phase 更新不再存在,如 EP+投机解码同时启用时行为需确认) |
📝 PR 规范检查
标题格式符合 Cherry-Pick 规范 [Cherry-Pick][XPU] ... ✅。但 ## Usage or Command 段为空(仅模板注释),Checklist 全部未勾选(实际已新增单测 test_build_sampling_params.py 且提供了精度对比截图)。建议补全如下:
标题建议(可直接复制):
[Cherry-Pick][XPU][Speculative Decoding] Enable CUDAGraph capture for MTP draft model (#7864)
PR 描述建议(点击展开,可直接复制)
## Motivation
在 PR #7864 中,XPU 平台 MTP 投机解码的 target model 已支持 CUDAGraph,但 draft model 侧仍未启用 CUDAGraph capture。本 PR 在 #7864 基础上,为 MTP draft model 补齐 CUDAGraph 支持,主要包括:
1. draft model 前向推理启用 `step_use_cudagraph` 门控逻辑,并在 multi-step 执行中仅对首步进行 capture。
2. draft model 推理路径中传递 `forward_meta` 和 `use_cudagraph` 到 `xpu_pre_process`,确保 `cu_seqlens_q_output` / `batch_id_per_token_output` 在 cudagraph 模式下使用 `copy_` 原地更新,保证 tensor 地址稳定性。
3. 新增 `padding_cudagraph_inputs()` 方法处理 draft model 的 buffer padding,并在 graph replay 时按 `real_token_num` 切片 model output。
4. target model 侧投机解码 warmup 流程适配(capture size 计算、`accept_all_drafts` 参数传递)。
5. 将 `padding_sampling_params`(Python 侧 CPU 实现)替换为 `build_sampling_params` XPU 自定义算子,在算子内部完成 `infer_seed` 的原地更新,避免在 cudagraph 外额外操作。
6. `increment_value` 改为与投机解码 token 数联动(`(num_speculative_tokens + 1) * 4`)。
## Modifications
- **`custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`**:新增 Paddle 自定义算子入口,注册 `build_sampling_params` op。
- **`custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h`**:声明 `build_sampling_params` C 接口。
- **`custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`**:新增 XPU3 kernel 实现。
- **`custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`**:新增 CPU wrapper 和 XPU3 wrapper。
- **`custom_ops/xpu_ops/test/test_build_sampling_params.py`**:新增单元测试,覆盖纯 decoder、纯 encoder、混合、单条、seed wrap-around 等场景。
- **`fastdeploy/model_executor/layers/sample/sampler.py`**:`forward_xpu` 改用 `build_sampling_params` XPU 算子替代 `padding_sampling_params`;新增 `increment_value` 参数。
- **`fastdeploy/model_executor/xpu_pre_and_post_process.py`**:cudagraph 模式下改用 `copy_` 原地更新 `cu_seqlens_q_output` 和 `batch_id_per_token_output`,保证 graph 捕获的 tensor 地址稳定。
- **`fastdeploy/spec_decode/mtp_xpu.py`**:draft model 启用 `step_use_cudagraph` 门控;`_propose` 新增 cudagraph padding 逻辑与 output slicing;`_initialize_forward_meta` 传递 cudagraph 参数。
- **`fastdeploy/worker/xpu_model_runner.py`**:`increment_value` 与投机解码 token 数联动;warmup capture 流程适配 speculative decoding;`infer_seed` 更新移入 `build_sampling_params` 算子内部;draft model propose 传递 `step_use_cudagraph`。
- **`tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py → test_mtp_cudagraph.py`**:重命名测试脚本以符合 CI 命名规范。
## Usage or Command
N/A
## Accuracy Tests
- MTP with CUDAGraph:输出与参考结果一致(见 PR 截图)
- MTP without CUDAGraph:输出与参考结果一致(见 PR 截图)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.总体评价
整体实现思路清晰,CUDAGraph capture 逻辑(tensor 地址稳定性、首步 capture、output slicing)设计合理,单测覆盖场景充分。4 个历史建议项(F1/F3 substep 参数签名缺失、F4/F8 整除截断、F5 O(bs²) 性能、F7 注释代码)仍未修复,建议在本次合入前一并解决。
| GM2LM(seq_lens_encoder + bi, &lm_sle, sizeof(int)); // sync barrier | ||
|
|
||
| bool is_decoder = (lm_sle == 0); | ||
| int repeat = is_decoder ? lm_slt : 1; |
There was a problem hiding this comment.
🟡 建议 pad_start 计算存在 O(bs²) 性能问题(历史 F5 仍存在)
当前实现中,每个 cluster 的 core 0 都对前 bi 个 batch 项做顺序扫描计算 pad_start,整体复杂度为 O(bs²)。在 bs 较大时会产生明显的延迟。
建议修复:改用前缀和方式,在进入 batch 循环前一次性计算所有 pad_start[],时间复杂度 O(bs)。参考 CPU wrapper 的实现方式,在 core 0 预先计算一个 pad_offsets[] 数组,然后各 cluster 直接读取对应偏移。
| @@ -298,3 +329,16 @@ def _update_status(self): | |||
| self.target_model_inputs["seq_lens_encoder"], | |||
There was a problem hiding this comment.
🟡 建议 遗留大量注释代码,建议清理(历史 F7 仍存在)
_propose 方法中存在大量被注释掉的代码块(约 15 行),包括 is_blocking、token_num_cpu 等逻辑,降低代码可读性。
建议:确认这些代码不再需要后彻底删除,如仍在讨论中,应在注释中注明对应 issue 编号或预期修复时间。
Motivation
在 PR #7864 中,XPU 平台 MTP 投机解码的 target model 已支持 CUDAGraph,但 draft model 侧仍未启用 CUDAGraph capture。本 PR 在 #7864 基础上,为 MTP draft model 补齐 CUDAGraph 支持,主要包括:
step_use_cudagraph门控逻辑,并在 multi-step 执行中仅对首步进行 capture。forward_meta和use_cudagraph到xpu_pre_process,确保cu_seqlens_q_output/batch_id_per_token_output在 cudagraph 模式下使用copy_原地更新,保证 tensor 地址稳定性。padding_cudagraph_inputs()方法处理 draft model 的 buffer padding,并在 graph replay 时按real_token_num切片 model output。accept_all_drafts参数传递)。padding_sampling_params(Python 侧 CPU 实现)替换为build_sampling_paramsXPU 自定义算子,在算子内部完成infer_seed的原地更新,避免在 cudagraph 外额外操作。increment_value改为与投机解码 token 数联动((num_speculative_tokens + 1) * 4)。Modifications
custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc:新增 Paddle 自定义算子入口,注册build_sampling_paramsop。custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h:声明build_sampling_paramsC 接口。custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu:新增 XPU3 kernel 实现。custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp:新增 CPU wrapper 和 XPU3 wrapper。custom_ops/xpu_ops/test/test_build_sampling_params.py:新增单元测试,覆盖纯 decoder、纯 encoder、混合、单条、seed wrap-around 等场景。fastdeploy/model_executor/layers/sample/sampler.py:forward_xpu改用build_sampling_paramsXPU 算子替代padding_sampling_params;新增increment_value参数。fastdeploy/model_executor/xpu_pre_and_post_process.py:cudagraph 模式下改用copy_原地更新cu_seqlens_q_output和batch_id_per_token_output,保证 graph 捕获的 tensor 地址稳定。fastdeploy/spec_decode/mtp_xpu.py:draft model 启用step_use_cudagraph门控;_propose新增 cudagraph padding 逻辑与 output slicing;_initialize_forward_meta传递 cudagraph 参数。fastdeploy/worker/xpu_model_runner.py:increment_value与投机解码 token 数联动;warmup capture 流程适配 speculative decoding;infer_seed更新移入build_sampling_params算子内部;draft model propose 传递step_use_cudagraph。tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py → test_mtp_cudagraph.py:重命名测试脚本以符合 CI 命名规范。Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.