Add and verify support for `deterministic` fp8 dpa/mha on SM100 by sudhakarsingh27 · Pull Request #2621 · NVIDIA/TransformerEngine

sudhakarsingh27 · 2026-01-24T00:02:17Z

Description

Follow up for #2584 to add and verify support for "deterministic" fp8 dpa/mha cudnn attention kernels

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

plumb deterministic argument through fused_attn_fp8.cu.
adjust filters in pytorch/attention/dot_product_attention/utils.py to allow fp8 + deterministic kernels on SM100
edit tests in test_attention.py to check fp8 with deterministic=True

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…eterminism_sm100

for more information, see https://pre-commit.ci

sudhakarsingh27 · 2026-01-24T00:05:03Z

/te-ci pytorch L1

greptile-apps · 2026-01-24T00:07:43Z

Greptile Summary

This PR enables deterministic FP8 fused attention on SM90 (H100) and SM100 (Blackwell) GPUs with cuDNN 9.19.0+. The implementation correctly plumbs the deterministic parameter through the C++ backend and updates the Python filter logic to allow this configuration. Previous review feedback has been addressed - the filter now properly sets use_fused_attention = False and fused_attention_backend = None when disabling the backend, and includes the required cuDNN version check alongside the architecture requirement.

Key changes:

C++ backend now passes deterministic flag and calls set_deterministic_algorithm() when cuDNN >= 9.19.0
Python filter allows FP8 deterministic on SM90+ with cuDNN 9.19.0+
Tests no longer set NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 to verify determinism works

The implementation is consistent with the existing FP8 current scaling filter pattern (line 504) which also requires specific cuDNN versions for determinism.

Confidence Score: 5/5

Safe to merge - well-structured implementation with proper version guards and no functional issues
Implementation correctly addresses previous review feedback, follows existing patterns in the codebase, includes proper cuDNN version and architecture checks, and successfully removes test workarounds to verify determinism
No files require special attention

Important Files Changed

Filename	Overview
transformer_engine/common/fused_attn/fused_attn_fp8.cu	Added `deterministic` parameter to backward pass, plumbed through function signature and set on cuDNN graph when version >= 9.19.0
transformer_engine/pytorch/attention/dot_product_attention/utils.py	Updated filter to allow FP8 deterministic mode on SM90+ with cuDNN 9.19.0+, previously blocked all FP8 deterministic training
tests/pytorch/attention/test_attention.py	Removed `NVTE_ALLOW_NONDETERMINISTIC_ALGO=1` from FP8 tests to verify deterministic execution works correctly

_{Last reviewed commit: 75cd00d}

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-24T00:07:42Z

transformer_engine/common/fused_attn/fused_attn_fp8.cu

+      if (cudnn_runtime_version >= 91900) {
+        sdpa_backward_options.set_deterministic_algorithm(deterministic);
+      }


logic: Version check uses 91900 (cuDNN 9.19.0), but related PR #2584 and description mention 9.18.1+ requirement. Should this be 91810 instead?

Suggested change

if (cudnn_runtime_version >= 91900) {

sdpa_backward_options.set_deterministic_algorithm(deterministic);

}

if (cudnn_runtime_version >= 91810) {

sdpa_backward_options.set_deterministic_algorithm(deterministic);

}

Is there a specific reason FP8 requires cuDNN 9.19.0+ while FP16/BF16 only needs 9.18.1+?

…eterminism_sm100

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…ansformerEngine into fp8_determinism_sm100

…eterminism_sm100

sudhakarsingh27 · 2026-02-11T00:52:48Z

/te-ci pytorch L0

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-17T18:48:35Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

+        if (
+            fused_attention_backend == FusedAttnBackend["FP8"]
+            and is_training
+            and device_compute_capability < (10, 0)
+        ):
+            logger.debug(
+                "Disabling FusedAttention for determinism reasons with FP8 on arch < sm100"
+            )
            use_fused_attention = False
            fused_attention_backend = None


Missing cuDNN version guard for FP8 determinism on SM100

The new check allows FP8 FusedAttention with deterministic=True on SM100 (sm100+) without any cuDNN version requirement. However, in fused_attn_fp8_bwd_impl_v1, the actual set_deterministic_algorithm() call is gated behind cudnn_runtime_version >= 91900 (cuDNN 9.19.0).

For FP8 delayed scaling on SM100 with cuDNN 9.14.0–9.18.x and deterministic=True, this Python filter will allow the backend but the C++ layer will silently skip setting deterministic mode, resulting in non-deterministic execution without any warning.

The analogous filter for FP8 current scaling (line 504) already has an explicit cudnn_version < (9, 18, 0) guard. The same approach should be used here.

Suggested fix:

if ( fused_attention_backend == FusedAttnBackend["FP8"] and is_training and ( device_compute_capability < (10, 0) or cudnn_version < (9, 19, 0) ) ): logger.debug( "Disabling FusedAttention for determinism reasons with FP8 on arch < sm100" " or cuDNN < 9.19.0" ) use_fused_attention = False fused_attention_backend = None

greptile-apps · 2026-02-17T18:48:41Z

Additional Comments (1)

tests/pytorch/attention/test_attention.py
deterministic parameter not propagated to actual model runs

The deterministic parameter is used for backend availability checks (via get_available_attention_backends) but is never passed to _run_mha_fp8_vs_f16. That helper function does not accept or apply the deterministic setting when calling mha(...), so both deterministic=True and deterministic=False parametrize cases actually execute identically, providing no additional coverage over a single run.

The same issue exists in test_dpa_fp8_vs_f16 / _run_dpa_fp8_vs_f16.

To actually test deterministic execution, _run_mha_fp8_vs_f16 and _run_dpa_fp8_vs_f16 should set os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] based on the deterministic flag before running the model — which is precisely what the removed lines did unconditionally. For example:

    os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "0" if deterministic else "1"
    flash_attn_fwd_fp8, param_names, flash_attn_bwd_fp8 = _run_mha_fp8_vs_f16(...)

Without this, the test expansion doubles the test matrix with no behavioral difference.

greptile-apps

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-18T22:01:41Z

Additional Comments (1)

tests/pytorch/attention/test_attention.py
deterministic not passed to execution functions

The new deterministic test parameter is used for backend availability filtering (lines 1854-1862, 1868-1873) but is never passed to _run_mha_fp8_vs_f16() (lines 1885-1896). The _run_mha_fp8_vs_f16 function doesn't accept a deterministic parameter, so DotProductAttention / MultiheadAttention modules inside will always use self.deterministic = False (from the default NVTE_ALLOW_NONDETERMINISTIC_ALGO=1).

The same issue applies to _run_dpa_fp8_vs_f16() calls in test_dpa_fp8_vs_f16 (lines 2142-2163).

As a result, when deterministic=True, the test only validates that the backend is correctly selected/rejected — it never actually exercises the deterministic code path through cuDNN. If the intent is to also verify deterministic execution, deterministic needs to be plumbed through to the run functions and set via NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 before module construction.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-20T21:25:45Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

+        if (
+            fused_attention_backend == FusedAttnBackend["FP8"]
+            and is_training
+            and device_compute_capability < (10, 0)
+        ):
+            logger.debug(
+                "Disabling FusedAttention for determinism reasons with FP8 on arch < sm100"
+            )
            use_fused_attention = False
            fused_attention_backend = None


Missing cuDNN version guard - Python allows FP8 deterministic on SM100 without checking cuDNN version, but C++ only sets deterministic mode when cudnn_runtime_version >= 91900 (fused_attn_fp8.cu:2220). With cuDNN 9.14.0-9.18.x, execution will be non-deterministic despite deterministic=True.

Add version check like FP8 current scaling (line 504):

Suggested change

if (

fused_attention_backend == FusedAttnBackend["FP8"]

and is_training

and device_compute_capability < (10, 0)

):

logger.debug(

"Disabling FusedAttention for determinism reasons with FP8 on arch < sm100"

)

use_fused_attention = False

fused_attention_backend = None

if (

fused_attention_backend == FusedAttnBackend["FP8"]

and is_training

and (

device_compute_capability < (10, 0)

or cudnn_version < (9, 19, 0)

)

):

logger.debug(

"Disabling FusedAttention for determinism reasons with FP8 on arch < sm100 "

"or cuDNN < 9.19.0"

)

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

…eterminism_sm100

sudhakarsingh27 · 2026-02-23T19:46:53Z

/te-ci L0

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/attention/dot_product_attention/utils.py

…ls.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 · 2026-02-23T19:53:26Z

/te-ci pytorch L0

for more information, see https://pre-commit.ci

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/attention/dot_product_attention/utils.py

sudhakarsingh27 · 2026-02-23T22:46:38Z

/te-ci jax L0

Actually switch off fused-attention backend Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

sudhakarsingh27 · 2026-02-23T22:58:30Z

/te-ci L0

tests/pytorch/attention/test_attention.py

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 · 2026-02-24T00:53:50Z

/te-ci L0

for more information, see https://pre-commit.ci

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

@greptile-apps

* add fp8 determinism support Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update cudnn fe to 1.18 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * enable determinism for sm90 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/attention/dot_product_attention/utils.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestion from @greptile-apps[bot] Actually switch off fused-attention backend Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * remove extraneous `deterministic` test input arg Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

@greptile-apps

…IA#2621) * add fp8 determinism support Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update cudnn fe to 1.18 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * enable determinism for sm90 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/attention/dot_product_attention/utils.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestion from @greptile-apps[bot] Actually switch off fused-attention backend Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * remove extraneous `deterministic` test input arg Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

add fp8 determinism support

015d346

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 self-assigned this Jan 24, 2026

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fp8_d…

6b71500

…eterminism_sm100

sudhakarsingh27 requested review from KshitijLakhani and cyanguwa January 24, 2026 00:03

[pre-commit.ci] auto fixes from pre-commit.com hooks

d785c52

for more information, see https://pre-commit.ci

cyanguwa added the 2.13.0 label Jan 24, 2026

greptile-apps bot reviewed Jan 24, 2026

View reviewed changes

sudhakarsingh27 added 4 commits February 9, 2026 17:26

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fp8_d…

8ce534d

…eterminism_sm100

update cudnn fe to 1.18

1cfc2ce

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'fp8_determinism_sm100' of github.com:sudhakarsingh27/Tr…

0fe3ab0

…ansformerEngine into fp8_determinism_sm100

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fp8_d…

ae7ff3b

…eterminism_sm100

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

Merge branch 'main' into fp8_determinism_sm100

8a98792

greptile-apps bot reviewed Feb 17, 2026

View reviewed changes

Merge branch 'main' into fp8_determinism_sm100

37e9d28

greptile-apps bot reviewed Feb 18, 2026

View reviewed changes

resolve conflicts while mergin main

cbcc973

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

sudhakarsingh27 and others added 3 commits February 23, 2026 11:40

enable determinism for sm90

1c684b7

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9e04dcb

for more information, see https://pre-commit.ci

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into fp8_d…

96aa4e0

…eterminism_sm100

greptile-apps bot reviewed Feb 23, 2026

View reviewed changes

transformer_engine/pytorch/attention/dot_product_attention/utils.py Show resolved Hide resolved

Update transformer_engine/pytorch/attention/dot_product_attention/uti…

9483df4

…ls.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

923db5e

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 23, 2026

View reviewed changes

transformer_engine/pytorch/attention/dot_product_attention/utils.py Show resolved Hide resolved

Apply suggestion from @greptile-apps[bot]

ce2dc79

Actually switch off fused-attention backend Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

greptile-apps bot reviewed Feb 23, 2026

View reviewed changes

cyanguwa reviewed Feb 24, 2026

View reviewed changes

tests/pytorch/attention/test_attention.py Show resolved Hide resolved

cyanguwa requested changes Feb 24, 2026

View reviewed changes

remove extraneous deterministic test input arg

8b0c874

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

75cd00d

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

cyanguwa approved these changes Feb 24, 2026

View reviewed changes

sudhakarsingh27 merged commit e8f7c5a into NVIDIA:main Feb 24, 2026
36 of 42 checks passed

Conversation

sudhakarsingh27 commented Jan 24, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

sudhakarsingh27 commented Jan 24, 2026

Uh oh!

greptile-apps bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 commented Feb 11, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 17, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 18, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 commented Feb 23, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sudhakarsingh27 commented Feb 23, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sudhakarsingh27 commented Feb 23, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

sudhakarsingh27 commented Feb 23, 2026

Uh oh!

Uh oh!

sudhakarsingh27 commented Feb 24, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Jan 24, 2026 •

edited

Loading