Skip to content

[Feature][OP] Add V100 (SM70) GPU Support#6306

Open
mattheliu wants to merge 12 commits intoPaddlePaddle:developfrom
mattheliu:fastdeploy_v100
Open

[Feature][OP] Add V100 (SM70) GPU Support#6306
mattheliu wants to merge 12 commits intoPaddlePaddle:developfrom
mattheliu:fastdeploy_v100

Conversation

@mattheliu
Copy link
Collaborator

@mattheliu mattheliu commented Feb 2, 2026

Motivation

为 FastDeploy 添加 NVIDIA V100 GPU (SM70 架构) 支持,使其能在旧版 GPU 上进行开发测试。由于 V100 不支持以下特性,需要同时适配编译系统和运行时逻辑:

  • BF16 数据类型:需要 SM80+ (Ampere)
  • FP8 量化:需要 SM89+ (Ada Lovelace)
  • cp.async 指令:需要 SM80+ (Ampere),影响 Append Attention 和 MLA Attention
  • Marlin GEMM:需要 SM80+ (Ampere)

Modifications

编译系统

  • setup_ops.py: 支持 SM70+ 编译,分离 SM70/SM80+ 特有代码
  • cpp_extensions.cc: 添加 ENABLE_APPEND_ATTENTIONENABLE_BF16 宏控制条件编译

CUDA Kernel

  • gelu_tanh.cu: 修复 tanh.approx.f32 PTX 指令在 SM70 的编译问题
  • moe_wna16_marlin_*.cu/h: 修复 Marlin GEMM 模板在 SM70 的编译兼容性

Python 运行时层

  • fastdeploy/platforms/cuda.py:

    • 添加 SM 版本检测方法 (get_sm_version())
    • 添加硬件能力检查 (supports_bf16(), supports_fp8(), supports_async_copy(), supports_marlin())
    • Attention backend 自动 fallback (APPEND_ATTN/MLA_ATTN → FLASH_ATTN)
  • fastdeploy/config.py: BF16→FP16 dtype 自动降级

  • fastdeploy/model_executor/layers/moe/moe.py:

    • Marlin MoE backend → CUTLASS fallback (SM<80)
    • Triton MoE backend → CUTLASS fallback (SM<80)
  • fastdeploy/model_executor/layers/quantization/__init__.py:

    • FP8 量化方法自动 fallback (block_wise_fp8wint8, w4afp8wint4)
  • fastdeploy/model_executor/layers/quantization/mix_quant.py:

    • MixQuantConfig 中 FP8 quant type 自动 fallback
  • fastdeploy/model_executor/layers/quantization/weight_only.py:

    • WeightOnlyConfig 中 Marlin/Triton backend fallback
  • fastdeploy/model_executor/layers/quantization/block_wise_fp8.py:

    • deep_gemm 导入保护 (SM<89 时跳过)
  • attention/ops/*.py: 为 SM80+ 专属 ops 添加 try-except 保护

测试

  • tests/layers/test_attention_layer.py: 添加 FP8 SM89+ skip 装饰器
  • tests/layers/test_fusedmoe.py: 添加 FP8 SM89+ skip 装饰器
  • tests/quantization/test_w4afp8.py: 添加 FP8 SM89+ skip 装饰器
  • tests/layers/test_ffn.py: 根据 SM 版本自动选择 dtype 和量化配置

SM70 Fallback 策略总览

功能 原始 SM70 Fallback 原因
数据类型 BF16 FP16 BF16 需要 SM80+
Attention Backend APPEND_ATTN FLASH_ATTN cp.async 需要 SM80+
Attention Backend MLA_ATTN FLASH_ATTN cp.async 需要 SM80+
MoE Backend Marlin CUTLASS Marlin 需要 SM80+
MoE Backend Triton CUTLASS tritonmoe_preprocess 需要 SM80+
量化 block_wise_fp8 wint8 FP8 需要 SM89+
量化 w4afp8 wint4 FP8 需要 SM89+
量化 wfp8afp8 wint8 FP8 需要 SM89+
量化 tensor_wise_fp8 wint8 FP8 需要 SM89+

Usage or Command

# 编译 (指定 SM70 架构)
cd custom_ops && bash build.sh 1 python false [70]

# 或使用 setup_ops.py
cd custom_ops && python setup_ops.py install

# 运行测试
bash test_sm70_compat.sh
# 或单独运行
pytest tests/platforms/test_platforms.py -v
pytest tests/layers/test_attention_layer.py -v
pytest tests/layers/test_ffn.py -v
pytest tests/layers/test_fusedmoe.py -v
pytest tests/quantization/test_w4afp8.py -v
pytest tests/quantization/ -v

Accuracy Tests

V100 (SM70) 上测试结果:

=== 1. Platform Detection ===
current sm_version=70
Platform: CUDAPlatform
Is V100 (SM70): True

=== 2. Platform Tests ===
29 passed, 0 failed

=== 3. Attention Tests ===
5 passed, 1 skipped (FP8 quantization requires SM89+)

=== 4. FFN Tests ===
1 passed, 0 failed

=== 5. MoE Tests ===
1 skipped (FP8 quantization requires SM89+)

=== 6. W4AFP8 Quantization Tests ===
6 passed, 5 skipped (FP8 ops require SM89+)

=== 7. All Quantization Tests ===
46 passed, 9 skipped

=== 8. Non-FP8 Quantization Tests ===
36 passed, 1 skipped (XPU)

Total: 123+ passed, 17 skipped, 0 failed

所有 FP8 相关测试在 V100 上正确跳过(显示 SKIPPED (FP8 ops require SM89+)),非 FP8 功能全部通过。

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Feb 2, 2026

Thanks for your contribution!

Support FP16 inference on V100 by adding SM70 compilation flags, disabling BF16/FP8 quantization, and graceful fallback for SM80+ only ops.
Resolve merge conflicts:
- custom_ops/gpu_ops/cpp_extensions.cc: Keep both MaskedPerTokenQuant and FusedMaskSwigluFP8Quant
- fused_moe_deepgemm_backend.py: Keep _fp8_quant_blockwise_compat for V100 compatibility
- block_wise_fp8.py: Keep SM70/V100 compatibility checks for FP8 support

Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>
The per_token_quant_fp8.cu file was removed and its functionality
was moved to quantization/common.cu. Remove the stale reference
from setup_ops.py to fix CI build failure.

Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>
…gluFP8Quant

- Remove PerTokenQuantPadding and MaskedPerTokenQuant declarations which have no implementation
- Restore fused_mask_swiglu_fp8_quant pybind registration (was incorrectly changed to masked_per_token_quant)
- Fix parameter name from recv_expert_count to token_nums_per_expert to match upstream

This fixes the CI build error: undefined symbol _Z19MaskedPerTokenQuantRN6paddle6TensorES1_ib

Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>
The FusedMaskSwigluFP8Quant function is declared in cpp_extensions.cc but its
implementation file was missing from the build sources list, causing undefined
symbol error during linking.

This fixes CI build error: undefined symbol _Z23FusedMaskSwigluFP8QuantRN6paddle6TensorES1_ib

Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>
The SetStop function is declared in metax_ops/cpp_extensions.cc but its
implementation file (gpu_ops/set_stop.cu) was missing from the MetaX build
sources list, causing undefined symbol error.

This fixes MetaX CI build error: undefined symbol _Z7SetStopRN6paddle6TensorEb

Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>
@mattheliu
Copy link
Collaborator Author

/Re-run failed jobs

@mattheliu mattheliu marked this pull request as ready for review February 4, 2026 05:47
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 43.21608% with 113 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@9b0a82c). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...loy/model_executor/layers/quantization/__init__.py 28.57% 23 Missing and 2 partials ⚠️
...oy/model_executor/layers/quantization/mix_quant.py 39.28% 15 Missing and 2 partials ⚠️
fastdeploy/platforms/cuda.py 68.29% 11 Missing and 2 partials ⚠️
.../model_executor/layers/quantization/weight_only.py 21.42% 7 Missing and 4 partials ⚠️
..._executor/layers/moe/fused_moe_deepgemm_backend.py 11.11% 8 Missing ⚠️
...del_executor/layers/quantization/block_wise_fp8.py 42.85% 6 Missing and 2 partials ⚠️
fastdeploy/model_executor/layers/moe/moe.py 22.22% 4 Missing and 3 partials ⚠️
...l_executor/layers/moe/fused_moe_cutlass_backend.py 40.00% 6 Missing ⚠️
fastdeploy/config.py 62.50% 1 Missing and 2 partials ⚠️
...cutor/layers/attention/ops/flash_mask_attention.py 0.00% 3 Missing ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6306   +/-   ##
==========================================
  Coverage           ?   67.53%           
==========================================
  Files              ?      389           
  Lines              ?    52098           
  Branches           ?     8116           
==========================================
  Hits               ?    35184           
  Misses             ?    14344           
  Partials           ?     2570           
Flag Coverage Δ
GPU 67.53% <43.21%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

MetaX platform needs gelu_tanh op for model inference.
This was accidentally removed in the SM70 compatibility changes.

Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>
Resolved conflicts:
- fused_moe_deepgemm_backend.py: Keep upstream FD_USE_PHI_FP8_QUANT logic,
  use _fp8_quant_blockwise_compat for phi branch compatibility
- block_wise_fp8.py: Same approach - upstream logic with compat wrapper

Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants