Skip to content

[Cherry-Pick][XPU] Enable CudaGraph capture for MTP draft model(#7864)#7941

Open
Clarity256 wants to merge 4 commits into
PaddlePaddle:developfrom
Clarity256:feature/mtp-cudagraph-draft-model-xpu
Open

[Cherry-Pick][XPU] Enable CudaGraph capture for MTP draft model(#7864)#7941
Clarity256 wants to merge 4 commits into
PaddlePaddle:developfrom
Clarity256:feature/mtp-cudagraph-draft-model-xpu

Conversation

@Clarity256

@Clarity256 Clarity256 commented May 27, 2026

Copy link
Copy Markdown

Motivation

在 PR #7864 中,XPU 平台 MTP 投机解码的 target model 已支持 CUDAGraph,但 draft model 侧仍未启用 CUDAGraph capture。本 PR 在 #7864 基础上,为 MTP draft model 补齐 CUDAGraph 支持,主要包括:

  1. draft model 前向推理启用 step_use_cudagraph 门控逻辑,并在 multi-step 执行中仅对首步进行 capture。
  2. draft model 推理路径中传递 forward_metause_cudagraphxpu_pre_process,确保 cu_seqlens_q_output / batch_id_per_token_output 在 cudagraph 模式下使用 copy_ 原地更新,保证 tensor 地址稳定性。
  3. 新增 padding_cudagraph_inputs() 方法处理 draft model 的 buffer padding,并在 graph replay 时按 real_token_num 切片 model output。
  4. target model 侧投机解码 warmup 流程适配(capture size 计算、accept_all_drafts 参数传递)。
  5. padding_sampling_params(Python 侧 CPU 实现)替换为 build_sampling_params XPU 自定义算子,在算子内部完成 infer_seed 的原地更新,避免在 cudagraph 外额外操作。
  6. increment_value 改为与投机解码 token 数联动((num_speculative_tokens + 1) * 4)。

Modifications

  • custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc:新增 Paddle 自定义算子入口,注册 build_sampling_params op。
  • custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h:声明 build_sampling_params C 接口。
  • custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu:新增 XPU3 kernel 实现。
  • custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp:新增 CPU wrapper 和 XPU3 wrapper。
  • custom_ops/xpu_ops/test/test_build_sampling_params.py:新增单元测试,覆盖纯 decoder、纯 encoder、混合、单条、seed wrap-around 等场景。
  • fastdeploy/model_executor/layers/sample/sampler.pyforward_xpu 改用 build_sampling_params XPU 算子替代 padding_sampling_params;新增 increment_value 参数。
  • fastdeploy/model_executor/xpu_pre_and_post_process.py:cudagraph 模式下改用 copy_ 原地更新 cu_seqlens_q_outputbatch_id_per_token_output,保证 graph 捕获的 tensor 地址稳定。
  • fastdeploy/spec_decode/mtp_xpu.py:draft model 启用 step_use_cudagraph 门控;_propose 新增 cudagraph padding 逻辑与 output slicing;_initialize_forward_meta 传递 cudagraph 参数。
  • fastdeploy/worker/xpu_model_runner.pyincrement_value 与投机解码 token 数联动;warmup capture 流程适配 speculative decoding;infer_seed 更新移入 build_sampling_params 算子内部;draft model propose 传递 step_use_cudagraph
  • tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py → test_mtp_cudagraph.py:重命名测试脚本以符合 CI 命名规范。

Usage or Command

Accuracy Tests

  • MTP with CUDAGraph:输出与参考结果一致(见 PR 截图)
mtp_with_cudagraph - MTP without CUDAGraph:输出与参考结果一致(见 PR 截图) mtp_without_cudagraph

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot

paddle-bot Bot commented May 27, 2026

Copy link
Copy Markdown

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the contributor External developers label May 27, 2026
@CLAassistant

CLAassistant commented May 27, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Clarity256
❌ root


root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Based on PR PaddlePaddle#7864, this adds draft model CudaGraph support:
- Enable step_use_cudagraph for draft model with proper gating logic
- Pass forward_meta and use_cudagraph to xpu_pre_process in draft path
- Add padding_cudagraph_inputs() for draft model buffer management
- Slice model output by real_token_num when graph is active
- Include target model cudagraph changes (xpu_model_runner, xpu_pre_and_post_process)
- Add build_sampling_params XPU kernel for MTP

Verified stable with benchmark_serving (100 requests, concurrency=16, 0 failures).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Clarity256 Clarity256 force-pushed the feature/mtp-cudagraph-draft-model-xpu branch from 5a6882b to bf150a4 Compare May 27, 2026 09:19
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

PaddlePaddle-bot commented May 27, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-06 14:09:08

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 60c6965 | Merge base: d0a9661 (branch: develop)


1 Required任务 : 8/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 36 6 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage 环境问题 Job
Approval 需要 Approval Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — 环境问题(置信度: 高)

失败用例:

用例 错误摘要
operators/test_flash_mask_attn.py::TestFlashMaskAttention::test_fa3_with_mask, layers/test_attention_layer.py::TestAttentionPerformance::test_flash_attn_v3_with_mask, layers/test_flash_attn_func.py::TestFlashAttnFunc::test_fa3_with_mask Paddle 动态加载 libflashmaskv2.so 时缺少 nvshmem_bootstrap_uid.so.3
rl/test_dynamic_weight_manager.py::TestRecreateDeepepBuffer::test_not_first_load_recreates, rl/test_dynamic_weight_manager.py::TestUpdateParametersWithEP::test_not_first_load_with_ep, rl/test_dynamic_weight_manager.py 等 5 个 Paddle 安装包中的 libpaddle.so 未导出 Buffer,导致 DeepEP 导入失败
distributed/test_fusedmoe_ep_entry, layers/test_attention_layer 配置日志中 engine_worker_queue_port 为 None,被下标访问触发 TypeError

关键日志:

RuntimeError: (PreconditionNotMet) The third-party dynamic library (libflashmaskv2.so) ...
error code is nvshmem_bootstrap_uid.so.3: cannot open shared object file: No such file or directory
ImportError: cannot import name 'Buffer' from 'paddle.base.libpaddle'
fastdeploy/config.py:2393 engine_worker_queue_port[local_data_parallel_id]
TypeError: 'NoneType' object is not subscriptable
  • 根因摘要: CI 依赖库/运行环境不完整
    本次失败集中在 GPU/DeepEP 通用单测:FA3/flashmask 用例在 paddle/nn/functional/flash_attention.py 调用 _C_ops.flashmask_attention_v2 时缺失 nvshmem_bootstrap_uid.so.3;RL 动态权重用例在 fastdeploy/model_executor/layers/moe/ep.py:60-72 导入 Paddle DeepEP 时,paddle.base.libpaddle 不包含 Buffer 符号;配置日志还显示 fastdeploy/config.py:2336-2338 对空的 engine_worker_queue_port 做下标访问。上述失败点均不在本 PR 修改的 XPU/MTP CUDAGraph 代码路径内。

修复建议:

  1. 环境问题,请 rerun;若重跑仍失败,请检查 CI 镜像/Runner 是否正确安装 NVSHMEM 运行库,并确认 LD_LIBRARY_PATH 能找到 nvshmem_bootstrap_uid.so.3
  2. 检查 CI 使用的 Paddle 包与 DeepEP 版本是否匹配,确保 paddle.base.libpaddle 导出 Buffer 或切换到包含 DeepEP Buffer 的 Paddle 构建。
  3. config_error.log 持续出现,需检查单测启动参数/环境变量是否为 DP 配置传入有效的 engine_worker_queue_port 列表。

关联变更: 本 PR 变更文件为 custom_ops/xpu_ops/**fastdeploy/spec_decode/mtp_xpu.pyfastdeploy/worker/xpu_model_runner.pyfastdeploy/model_executor/xpu_pre_and_post_process.pyfastdeploy/model_executor/layers/sample/sampler.pytests/xpu_ci/4cards_cases/test_mtp_cudagraph.py;未修改上述失败用例或 fastdeploy/model_executor/layers/moe/ep.pyfastdeploy/config.py

🔴 Approval — 需要 Approval(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

修复建议: 请通过人工审批。

@codecov-commenter

codecov-commenter commented May 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 2.50000% with 39 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d0a9661). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/xpu_model_runner.py 0.00% 18 Missing ⚠️
fastdeploy/spec_decode/mtp_xpu.py 0.00% 12 Missing ⚠️
...tdeploy/model_executor/xpu_pre_and_post_process.py 0.00% 5 Missing ⚠️
fastdeploy/model_executor/layers/sample/sampler.py 20.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7941   +/-   ##
==========================================
  Coverage           ?   66.98%           
==========================================
  Files              ?      467           
  Lines              ?    65630           
  Branches           ?    10113           
==========================================
  Hits               ?    43962           
  Misses             ?    18885           
  Partials           ?     2783           
Flag Coverage Δ
GPU 77.08% <20.00%> (?)
XPU 7.06% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Clarity256 Clarity256 force-pushed the feature/mtp-cudagraph-draft-model-xpu branch from 4ef7a1f to 00299e4 Compare May 28, 2026 06:39
PaddlePaddle-bot

This comment was marked as outdated.

When TP>1 with MTP+CudaGraph, the proposer's dummy_prefill_inputs was
hardcoded with expected_decode_len=1, causing shape mismatch during
CudaGraph capture. Use the actual expected_decode_len variable instead.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-03 11:03:46

📋 Review 摘要

PR 概述:为 XPU 平台 MTP 投机解码 draft model 补齐 CUDAGraph capture 支持,新增 build_sampling_params XPU 算子替代 Python 侧 padding_sampling_params,并适配 warmup/capture 流程。

变更范围custom_ops/xpu_ops/(新算子内核)、fastdeploy/spec_decode/mtp_xpu.pyfastdeploy/worker/xpu_model_runner.pyfastdeploy/model_executor/layers/sample/sampler.pyfastdeploy/model_executor/xpu_pre_and_post_process.py

影响面 Tag[XPU] [Speculative Decoding] [OP] [Graph Optimization]

问题

级别 文件 概述
🟡 建议 fastdeploy/spec_decode/mtp_xpu.py:113 substep 参数未在 _initialize_forward_meta 签名中声明,调用时会触发 TypeError
🟡 建议 fastdeploy/worker/xpu_model_runner.py:1210 capture_size 整除截断问题(历史 F4/F8 仍存在)
🟡 建议 custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu:75 pad_start 计算 O(bs²) 性能问题(历史 F5 仍存在)
🟡 建议 fastdeploy/spec_decode/mtp_xpu.py:329 遗留大量注释代码,降低可读性(历史 F7 仍存在)

历史 Findings 修复情况

Finding 问题 状态
F1 _initialize_forward_meta 方法签名缺少 substep 参数 ⚠️ 仍存在
F2 infer_seed 更新路径依赖 build_sampling_params 内部递增,非投机解码路径仍走外部更新 ✅ 已修复(diff 中 if not self.speculative_decoding 分支与 build_sampling_params 算子内部更新配合,逻辑闭环)
F3 _initialize_forward_meta 签名缺 substep ⚠️ 仍存在
F4 capture_size 整除截断 batch_size 偏小 ⚠️ 仍存在
F5 pad_start 计算 O(bs²) 性能问题 ⚠️ 仍存在
F6 sampling_metadata.topp_seed 语义不一致 ✅ 已修复(_normal_sample_xpu 中直接使用 sampling_metadata.topp_seed,与 build_sampling_params 算子输出对齐)
F7 遗留大量注释代码 ⚠️ 仍存在
F8 capture_size 整除截断(同 F4) ⚠️ 仍存在
F9 移除了 proposer.fd_config.parallel_config.moe_phase.phase 运行时更新 🔄 部分修复(原逻辑已拆分:use_ep 判断移至统一的 if self.fd_config.parallel_config.use_ep 分支,但 proposer 侧的 moe_phase.phase 更新不再存在,如 EP+投机解码同时启用时行为需确认)

📝 PR 规范检查

标题格式符合 Cherry-Pick 规范 [Cherry-Pick][XPU] ... ✅。但 ## Usage or Command 段为空(仅模板注释),Checklist 全部未勾选(实际已新增单测 test_build_sampling_params.py 且提供了精度对比截图)。建议补全如下:

标题建议(可直接复制):

  • [Cherry-Pick][XPU][Speculative Decoding] Enable CUDAGraph capture for MTP draft model (#7864)
PR 描述建议(点击展开,可直接复制)
## Motivation
在 PR #7864 中,XPU 平台 MTP 投机解码的 target model 已支持 CUDAGraph,但 draft model 侧仍未启用 CUDAGraph capture。本 PR 在 #7864 基础上,为 MTP draft model 补齐 CUDAGraph 支持,主要包括:

1. draft model 前向推理启用 `step_use_cudagraph` 门控逻辑,并在 multi-step 执行中仅对首步进行 capture。
2. draft model 推理路径中传递 `forward_meta``use_cudagraph``xpu_pre_process`,确保 `cu_seqlens_q_output` / `batch_id_per_token_output` 在 cudagraph 模式下使用 `copy_` 原地更新,保证 tensor 地址稳定性。
3. 新增 `padding_cudagraph_inputs()` 方法处理 draft model 的 buffer padding,并在 graph replay 时按 `real_token_num` 切片 model output。
4. target model 侧投机解码 warmup 流程适配(capture size 计算、`accept_all_drafts` 参数传递)。
5.`padding_sampling_params`(Python 侧 CPU 实现)替换为 `build_sampling_params` XPU 自定义算子,在算子内部完成 `infer_seed` 的原地更新,避免在 cudagraph 外额外操作。
6. `increment_value` 改为与投机解码 token 数联动(`(num_speculative_tokens + 1) * 4`)。

## Modifications
- **`custom_ops/xpu_ops/src/ops/mtp/build_sampling_params.cc`**:新增 Paddle 自定义算子入口,注册 `build_sampling_params` op。
- **`custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h`**:声明 `build_sampling_params` C 接口。
- **`custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/build_sampling_params.xpu`**:新增 XPU3 kernel 实现。
- **`custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/build_sampling_params.cpp`**:新增 CPU wrapper 和 XPU3 wrapper。
- **`custom_ops/xpu_ops/test/test_build_sampling_params.py`**:新增单元测试,覆盖纯 decoder、纯 encoder、混合、单条、seed wrap-around 等场景。
- **`fastdeploy/model_executor/layers/sample/sampler.py`**`forward_xpu` 改用 `build_sampling_params` XPU 算子替代 `padding_sampling_params`;新增 `increment_value` 参数。
- **`fastdeploy/model_executor/xpu_pre_and_post_process.py`**:cudagraph 模式下改用 `copy_` 原地更新 `cu_seqlens_q_output``batch_id_per_token_output`,保证 graph 捕获的 tensor 地址稳定。
- **`fastdeploy/spec_decode/mtp_xpu.py`**:draft model 启用 `step_use_cudagraph` 门控;`_propose` 新增 cudagraph padding 逻辑与 output slicing;`_initialize_forward_meta` 传递 cudagraph 参数。
- **`fastdeploy/worker/xpu_model_runner.py`**`increment_value` 与投机解码 token 数联动;warmup capture 流程适配 speculative decoding;`infer_seed` 更新移入 `build_sampling_params` 算子内部;draft model propose 传递 `step_use_cudagraph`- **`tests/xpu_ci/4cards_cases/run_mtp_cudagraph.py → test_mtp_cudagraph.py`**:重命名测试脚本以符合 CI 命名规范。

## Usage or Command
N/A

## Accuracy Tests
- MTP with CUDAGraph:输出与参考结果一致(见 PR 截图)
- MTP without CUDAGraph:输出与参考结果一致(见 PR 截图)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

整体实现思路清晰,CUDAGraph capture 逻辑(tensor 地址稳定性、首步 capture、output slicing)设计合理,单测覆盖场景充分。4 个历史建议项(F1/F3 substep 参数签名缺失、F4/F8 整除截断、F5 O(bs²) 性能、F7 注释代码)仍未修复,建议在本次合入前一并解决。

Comment thread fastdeploy/spec_decode/mtp_xpu.py
Comment thread fastdeploy/worker/xpu_model_runner.py
GM2LM(seq_lens_encoder + bi, &lm_sle, sizeof(int)); // sync barrier

bool is_decoder = (lm_sle == 0);
int repeat = is_decoder ? lm_slt : 1;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 pad_start 计算存在 O(bs²) 性能问题(历史 F5 仍存在)

当前实现中,每个 cluster 的 core 0 都对前 bi 个 batch 项做顺序扫描计算 pad_start,整体复杂度为 O(bs²)。在 bs 较大时会产生明显的延迟。

建议修复:改用前缀和方式,在进入 batch 循环前一次性计算所有 pad_start[],时间复杂度 O(bs)。参考 CPU wrapper 的实现方式,在 core 0 预先计算一个 pad_offsets[] 数组,然后各 cluster 直接读取对应偏移。

@@ -298,3 +329,16 @@ def _update_status(self):
self.target_model_inputs["seq_lens_encoder"],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 遗留大量注释代码,建议清理(历史 F7 仍存在)

_propose 方法中存在大量被注释掉的代码块(约 15 行),包括 is_blockingtoken_num_cpu 等逻辑,降低代码可读性。

建议:确认这些代码不再需要后彻底删除,如仍在讨论中,应在注释中注明对应 issue 编号或预期修复时间。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants