Add and verify support for deterministic fp8 dpa/mha on SM100#2621
Add and verify support for deterministic fp8 dpa/mha on SM100#2621sudhakarsingh27 merged 18 commits intoNVIDIA:mainfrom
deterministic fp8 dpa/mha on SM100#2621Conversation
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci pytorch L1 |
Greptile SummaryThis PR enables deterministic FP8 fused attention on SM90 (H100) and SM100 (Blackwell) GPUs with cuDNN 9.19.0+. The implementation correctly plumbs the Key changes:
The implementation is consistent with the existing FP8 current scaling filter pattern (line 504) which also requires specific cuDNN versions for determinism. Confidence Score: 5/5
Important Files Changed
Last reviewed commit: 75cd00d |
| if (cudnn_runtime_version >= 91900) { | ||
| sdpa_backward_options.set_deterministic_algorithm(deterministic); | ||
| } |
There was a problem hiding this comment.
logic: Version check uses 91900 (cuDNN 9.19.0), but related PR #2584 and description mention 9.18.1+ requirement. Should this be 91810 instead?
| if (cudnn_runtime_version >= 91900) { | |
| sdpa_backward_options.set_deterministic_algorithm(deterministic); | |
| } | |
| if (cudnn_runtime_version >= 91810) { | |
| sdpa_backward_options.set_deterministic_algorithm(deterministic); | |
| } |
Is there a specific reason FP8 requires cuDNN 9.19.0+ while FP16/BF16 only needs 9.18.1+?
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
…ansformerEngine into fp8_determinism_sm100
|
/te-ci pytorch L0 |
| if ( | ||
| fused_attention_backend == FusedAttnBackend["FP8"] | ||
| and is_training | ||
| and device_compute_capability < (10, 0) | ||
| ): | ||
| logger.debug( | ||
| "Disabling FusedAttention for determinism reasons with FP8 on arch < sm100" | ||
| ) | ||
| use_fused_attention = False | ||
| fused_attention_backend = None |
There was a problem hiding this comment.
Missing cuDNN version guard for FP8 determinism on SM100
The new check allows FP8 FusedAttention with deterministic=True on SM100 (sm100+) without any cuDNN version requirement. However, in fused_attn_fp8_bwd_impl_v1, the actual set_deterministic_algorithm() call is gated behind cudnn_runtime_version >= 91900 (cuDNN 9.19.0).
For FP8 delayed scaling on SM100 with cuDNN 9.14.0–9.18.x and deterministic=True, this Python filter will allow the backend but the C++ layer will silently skip setting deterministic mode, resulting in non-deterministic execution without any warning.
The analogous filter for FP8 current scaling (line 504) already has an explicit cudnn_version < (9, 18, 0) guard. The same approach should be used here.
Suggested fix:
if (
fused_attention_backend == FusedAttnBackend["FP8"]
and is_training
and (
device_compute_capability < (10, 0)
or cudnn_version < (9, 19, 0)
)
):
logger.debug(
"Disabling FusedAttention for determinism reasons with FP8 on arch < sm100"
" or cuDNN < 9.19.0"
)
use_fused_attention = False
fused_attention_backend = None
Additional Comments (1)
The The same issue exists in To actually test deterministic execution, os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "0" if deterministic else "1"
flash_attn_fwd_fp8, param_names, flash_attn_bwd_fp8 = _run_mha_fp8_vs_f16(...)Without this, the test expansion doubles the test matrix with no behavioral difference. |
Additional Comments (1)
The new The same issue applies to As a result, when |
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
| if ( | ||
| fused_attention_backend == FusedAttnBackend["FP8"] | ||
| and is_training | ||
| and device_compute_capability < (10, 0) | ||
| ): | ||
| logger.debug( | ||
| "Disabling FusedAttention for determinism reasons with FP8 on arch < sm100" | ||
| ) | ||
| use_fused_attention = False | ||
| fused_attention_backend = None |
There was a problem hiding this comment.
Missing cuDNN version guard - Python allows FP8 deterministic on SM100 without checking cuDNN version, but C++ only sets deterministic mode when cudnn_runtime_version >= 91900 (fused_attn_fp8.cu:2220). With cuDNN 9.14.0-9.18.x, execution will be non-deterministic despite deterministic=True.
Add version check like FP8 current scaling (line 504):
| if ( | |
| fused_attention_backend == FusedAttnBackend["FP8"] | |
| and is_training | |
| and device_compute_capability < (10, 0) | |
| ): | |
| logger.debug( | |
| "Disabling FusedAttention for determinism reasons with FP8 on arch < sm100" | |
| ) | |
| use_fused_attention = False | |
| fused_attention_backend = None | |
| if ( | |
| fused_attention_backend == FusedAttnBackend["FP8"] | |
| and is_training | |
| and ( | |
| device_compute_capability < (10, 0) | |
| or cudnn_version < (9, 19, 0) | |
| ) | |
| ): | |
| logger.debug( | |
| "Disabling FusedAttention for determinism reasons with FP8 on arch < sm100 " | |
| "or cuDNN < 9.19.0" | |
| ) |
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci L0 |
…ls.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
|
/te-ci pytorch L0 |
for more information, see https://pre-commit.ci
|
/te-ci jax L0 |
Actually switch off fused-attention backend Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
|
/te-ci L0 |
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
|
/te-ci L0 |
for more information, see https://pre-commit.ci
* add fp8 determinism support Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update cudnn fe to 1.18 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * enable determinism for sm90 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/attention/dot_product_attention/utils.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestion from @greptile-apps[bot] Actually switch off fused-attention backend Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * remove extraneous `deterministic` test input arg Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
…IA#2621) * add fp8 determinism support Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update cudnn fe to 1.18 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * enable determinism for sm90 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/attention/dot_product_attention/utils.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestion from @greptile-apps[bot] Actually switch off fused-attention backend Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * remove extraneous `deterministic` test input arg Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Description
Follow up for #2584 to add and verify support for "deterministic" fp8 dpa/mha cudnn attention kernels
Type of change
Changes
Please list the changes introduced in this PR:
deterministicargument throughfused_attn_fp8.cu.pytorch/attention/dot_product_attention/utils.pyto allow fp8 + deterministic kernels on SM100test_attention.pyto check fp8 withdeterministic=TrueChecklist: