[Feature][OP] Add V100 (SM70) GPU Support by mattheliu · Pull Request #6306 · PaddlePaddle/FastDeploy

mattheliu · 2026-02-02T08:43:04Z

Motivation

为 FastDeploy 添加 NVIDIA V100 GPU (SM70 架构) 支持，使其能在旧版 GPU 上进行开发测试。由于 V100 不支持以下特性，需要同时适配编译系统和运行时逻辑：

BF16 数据类型：需要 SM80+ (Ampere)
FP8 量化：需要 SM89+ (Ada Lovelace)
cp.async 指令：需要 SM80+ (Ampere)，影响 Append Attention 和 MLA Attention
Marlin GEMM：需要 SM80+ (Ampere)

Modifications

编译系统

setup_ops.py: 支持 SM70+ 编译，分离 SM70/SM80+ 特有代码
cpp_extensions.cc: 添加 ENABLE_APPEND_ATTENTION 和 ENABLE_BF16 宏控制条件编译

CUDA Kernel

gelu_tanh.cu: 修复 tanh.approx.f32 PTX 指令在 SM70 的编译问题
moe_wna16_marlin_*.cu/h: 修复 Marlin GEMM 模板在 SM70 的编译兼容性

Python 运行时层

fastdeploy/platforms/cuda.py:
- 添加 SM 版本检测方法 (get_sm_version())
- 添加硬件能力检查 (supports_bf16(), supports_fp8(), supports_async_copy(), supports_marlin())
- Attention backend 自动 fallback (APPEND_ATTN/MLA_ATTN → FLASH_ATTN)
fastdeploy/config.py: BF16→FP16 dtype 自动降级
fastdeploy/model_executor/layers/moe/moe.py:
- Marlin MoE backend → CUTLASS fallback (SM<80)
- Triton MoE backend → CUTLASS fallback (SM<80)
fastdeploy/model_executor/layers/quantization/__init__.py:
- FP8 量化方法自动 fallback (block_wise_fp8→wint8, w4afp8→wint4)
fastdeploy/model_executor/layers/quantization/mix_quant.py:
- MixQuantConfig 中 FP8 quant type 自动 fallback
fastdeploy/model_executor/layers/quantization/weight_only.py:
- WeightOnlyConfig 中 Marlin/Triton backend fallback
fastdeploy/model_executor/layers/quantization/block_wise_fp8.py:
- deep_gemm 导入保护 (SM<89 时跳过)
attention/ops/*.py: 为 SM80+ 专属 ops 添加 try-except 保护

测试

tests/layers/test_attention_layer.py: 添加 FP8 SM89+ skip 装饰器
tests/layers/test_fusedmoe.py: 添加 FP8 SM89+ skip 装饰器
tests/quantization/test_w4afp8.py: 添加 FP8 SM89+ skip 装饰器
tests/layers/test_ffn.py: 根据 SM 版本自动选择 dtype 和量化配置

SM70 Fallback 策略总览

功能	原始	SM70 Fallback	原因
数据类型	BF16	FP16	BF16 需要 SM80+
Attention Backend	APPEND_ATTN	FLASH_ATTN	cp.async 需要 SM80+
Attention Backend	MLA_ATTN	FLASH_ATTN	cp.async 需要 SM80+
MoE Backend	Marlin	CUTLASS	Marlin 需要 SM80+
MoE Backend	Triton	CUTLASS	tritonmoe_preprocess 需要 SM80+
量化	block_wise_fp8	wint8	FP8 需要 SM89+
量化	w4afp8	wint4	FP8 需要 SM89+
量化	wfp8afp8	wint8	FP8 需要 SM89+
量化	tensor_wise_fp8	wint8	FP8 需要 SM89+

Usage or Command

# 编译 (指定 SM70 架构)
cd custom_ops && bash build.sh 1 python false [70]

# 或使用 setup_ops.py
cd custom_ops && python setup_ops.py install

# 运行测试
bash test_sm70_compat.sh
# 或单独运行
pytest tests/platforms/test_platforms.py -v
pytest tests/layers/test_attention_layer.py -v
pytest tests/layers/test_ffn.py -v
pytest tests/layers/test_fusedmoe.py -v
pytest tests/quantization/test_w4afp8.py -v
pytest tests/quantization/ -v

Accuracy Tests

V100 (SM70) 上测试结果：

=== 1. Platform Detection ===
current sm_version=70
Platform: CUDAPlatform
Is V100 (SM70): True

=== 2. Platform Tests ===
29 passed, 0 failed

=== 3. Attention Tests ===
5 passed, 1 skipped (FP8 quantization requires SM89+)

=== 4. FFN Tests ===
1 passed, 0 failed

=== 5. MoE Tests ===
1 skipped (FP8 quantization requires SM89+)

=== 6. W4AFP8 Quantization Tests ===
6 passed, 5 skipped (FP8 ops require SM89+)

=== 7. All Quantization Tests ===
46 passed, 9 skipped

=== 8. Non-FP8 Quantization Tests ===
36 passed, 1 skipped (XPU)

Total: 123+ passed, 17 skipped, 0 failed

所有 FP8 相关测试在 V100 上正确跳过（显示 SKIPPED (FP8 ops require SM89+)），非 FP8 功能全部通过。

Checklist

Add at least a tag in the PR title.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-02-02T08:43:11Z

Thanks for your contribution!

Support FP16 inference on V100 by adding SM70 compilation flags, disabling BF16/FP8 quantization, and graceful fallback for SM80+ only ops.

Resolve merge conflicts: - custom_ops/gpu_ops/cpp_extensions.cc: Keep both MaskedPerTokenQuant and FusedMaskSwigluFP8Quant - fused_moe_deepgemm_backend.py: Keep _fp8_quant_blockwise_compat for V100 compatibility - block_wise_fp8.py: Keep SM70/V100 compatibility checks for FP8 support Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>

The per_token_quant_fp8.cu file was removed and its functionality was moved to quantization/common.cu. Remove the stale reference from setup_ops.py to fix CI build failure. Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>

…gluFP8Quant - Remove PerTokenQuantPadding and MaskedPerTokenQuant declarations which have no implementation - Restore fused_mask_swiglu_fp8_quant pybind registration (was incorrectly changed to masked_per_token_quant) - Fix parameter name from recv_expert_count to token_nums_per_expert to match upstream This fixes the CI build error: undefined symbol _Z19MaskedPerTokenQuantRN6paddle6TensorES1_ib Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>

The FusedMaskSwigluFP8Quant function is declared in cpp_extensions.cc but its implementation file was missing from the build sources list, causing undefined symbol error during linking. This fixes CI build error: undefined symbol _Z23FusedMaskSwigluFP8QuantRN6paddle6TensorES1_ib Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>

The SetStop function is declared in metax_ops/cpp_extensions.cc but its implementation file (gpu_ops/set_stop.cu) was missing from the MetaX build sources list, causing undefined symbol error. This fixes MetaX CI build error: undefined symbol _Z7SetStopRN6paddle6TensorEb Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>

mattheliu · 2026-02-04T05:43:03Z

/Re-run failed jobs

codecov-commenter · 2026-02-04T06:29:47Z

Codecov Report

❌ Patch coverage is 43.21608% with 113 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@9b0a82c). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...loy/model_executor/layers/quantization/__init__.py	28.57%	23 Missing and 2 partials ⚠️
...oy/model_executor/layers/quantization/mix_quant.py	39.28%	15 Missing and 2 partials ⚠️
fastdeploy/platforms/cuda.py	68.29%	11 Missing and 2 partials ⚠️
.../model_executor/layers/quantization/weight_only.py	21.42%	7 Missing and 4 partials ⚠️
..._executor/layers/moe/fused_moe_deepgemm_backend.py	11.11%	8 Missing ⚠️
...del_executor/layers/quantization/block_wise_fp8.py	42.85%	6 Missing and 2 partials ⚠️
fastdeploy/model_executor/layers/moe/moe.py	22.22%	4 Missing and 3 partials ⚠️
...l_executor/layers/moe/fused_moe_cutlass_backend.py	40.00%	6 Missing ⚠️
fastdeploy/config.py	62.50%	1 Missing and 2 partials ⚠️
...cutor/layers/attention/ops/flash_mask_attention.py	0.00%	3 Missing ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6306   +/-   ##
==========================================
  Coverage           ?   67.53%           
==========================================
  Files              ?      389           
  Lines              ?    52098           
  Branches           ?     8116           
==========================================
  Hits               ?    35184           
  Misses             ?    14344           
  Partials           ?     2570

Flag	Coverage Δ
GPU	`67.53% <43.21%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

MetaX platform needs gelu_tanh op for model inference. This was accidentally removed in the SM70 compatibility changes. Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>

Resolved conflicts: - fused_moe_deepgemm_backend.py: Keep upstream FD_USE_PHI_FP8_QUANT logic, use _fp8_quant_blockwise_compat for phi branch compatibility - block_wise_fp8.py: Same approach - upstream logic with compat wrapper Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>

mattheliu had a problem deploying to Metax_ci February 2, 2026 08:43 — with GitHub Actions Error

feat: add V100 (SM70) GPU support

f3216c0

Support FP16 inference on V100 by adding SM70 compilation flags, disabling BF16/FP8 quantization, and graceful fallback for SM80+ only ops.

mattheliu force-pushed the fastdeploy_v100 branch from aa40210 to f3216c0 Compare February 2, 2026 08:57

mattheliu had a problem deploying to Metax_ci February 2, 2026 08:57 — with GitHub Actions Error

mattheliu had a problem deploying to Metax_ci February 2, 2026 09:01 — with GitHub Actions Error

fix format

3b39080

mattheliu force-pushed the fastdeploy_v100 branch from 4a1f4d4 to 3b39080 Compare February 2, 2026 09:08

mattheliu had a problem deploying to Metax_ci February 2, 2026 09:08 — with GitHub Actions Failure

feat: add SM70 (V100) GPU architecture compatibility

7e09cb0

mattheliu had a problem deploying to Metax_ci February 3, 2026 13:12 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 3, 2026 13:27 — with GitHub Actions Error

mattheliu force-pushed the fastdeploy_v100 branch from 5cb9172 to b9dcf58 Compare February 3, 2026 13:35

mattheliu had a problem deploying to Metax_ci February 3, 2026 13:35 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 3, 2026 13:46 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 3, 2026 14:09 — with GitHub Actions Error

Merge branch 'develop' into fastdeploy_v100

a7587cc

mattheliu temporarily deployed to Metax_ci February 3, 2026 14:11 — with GitHub Actions Inactive

mattheliu had a problem deploying to Metax_ci February 4, 2026 03:35 — with GitHub Actions Error

Merge branch 'develop' into fastdeploy_v100

5a8a280

mattheliu had a problem deploying to Metax_ci February 4, 2026 03:35 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 4, 2026 04:07 — with GitHub Actions Failure

mattheliu marked this pull request as ready for review February 4, 2026 05:47

fix: add gelu_tanh.cu to MetaX build sources

c1df8fd

MetaX platform needs gelu_tanh op for model inference. This was accidentally removed in the SM70 compatibility changes. Co-Authored-By: Claude (Claude Opus 4.5) <[email protected]>

mattheliu had a problem deploying to Metax_ci February 4, 2026 06:32 — with GitHub Actions Failure

mattheliu had a problem deploying to Metax_ci February 5, 2026 05:02 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][OP] Add V100 (SM70) GPU Support#6306

[Feature][OP] Add V100 (SM70) GPU Support#6306
mattheliu wants to merge 12 commits intoPaddlePaddle:developfrom
mattheliu:fastdeploy_v100

mattheliu commented Feb 2, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Feb 2, 2026

Uh oh!

mattheliu commented Feb 4, 2026

Uh oh!

codecov-commenter commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mattheliu commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

编译系统

CUDA Kernel

Python 运行时层

测试

SM70 Fallback 策略总览

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Feb 2, 2026

Uh oh!

mattheliu commented Feb 4, 2026

Uh oh!

codecov-commenter commented Feb 4, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattheliu commented Feb 2, 2026 •

edited

Loading