Skip to content

Add and verify support for deterministic fp8 dpa/mha on SM100#2621

Merged
sudhakarsingh27 merged 18 commits intoNVIDIA:mainfrom
sudhakarsingh27:fp8_determinism_sm100
Feb 24, 2026
Merged

Add and verify support for deterministic fp8 dpa/mha on SM100#2621
sudhakarsingh27 merged 18 commits intoNVIDIA:mainfrom
sudhakarsingh27:fp8_determinism_sm100

Conversation

@sudhakarsingh27
Copy link
Collaborator

Description

Follow up for #2584 to add and verify support for "deterministic" fp8 dpa/mha cudnn attention kernels

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • plumb deterministic argument through fused_attn_fp8.cu.
  • adjust filters in pytorch/attention/dot_product_attention/utils.py to allow fp8 + deterministic kernels on SM100
  • edit tests in test_attention.py to check fp8 with deterministic=True

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 self-assigned this Jan 24, 2026
@sudhakarsingh27
Copy link
Collaborator Author

/te-ci pytorch L1

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 24, 2026

Greptile Summary

This PR enables deterministic FP8 fused attention on SM90 (H100) and SM100 (Blackwell) GPUs with cuDNN 9.19.0+. The implementation correctly plumbs the deterministic parameter through the C++ backend and updates the Python filter logic to allow this configuration. Previous review feedback has been addressed - the filter now properly sets use_fused_attention = False and fused_attention_backend = None when disabling the backend, and includes the required cuDNN version check alongside the architecture requirement.

Key changes:

  • C++ backend now passes deterministic flag and calls set_deterministic_algorithm() when cuDNN >= 9.19.0
  • Python filter allows FP8 deterministic on SM90+ with cuDNN 9.19.0+
  • Tests no longer set NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 to verify determinism works

The implementation is consistent with the existing FP8 current scaling filter pattern (line 504) which also requires specific cuDNN versions for determinism.

Confidence Score: 5/5

  • Safe to merge - well-structured implementation with proper version guards and no functional issues
  • Implementation correctly addresses previous review feedback, follows existing patterns in the codebase, includes proper cuDNN version and architecture checks, and successfully removes test workarounds to verify determinism
  • No files require special attention

Important Files Changed

Filename Overview
transformer_engine/common/fused_attn/fused_attn_fp8.cu Added deterministic parameter to backward pass, plumbed through function signature and set on cuDNN graph when version >= 9.19.0
transformer_engine/pytorch/attention/dot_product_attention/utils.py Updated filter to allow FP8 deterministic mode on SM90+ with cuDNN 9.19.0+, previously blocked all FP8 deterministic training
tests/pytorch/attention/test_attention.py Removed NVTE_ALLOW_NONDETERMINISTIC_ALGO=1 from FP8 tests to verify deterministic execution works correctly

Last reviewed commit: 75cd00d

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +2213 to +2215
if (cudnn_runtime_version >= 91900) {
sdpa_backward_options.set_deterministic_algorithm(deterministic);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Version check uses 91900 (cuDNN 9.19.0), but related PR #2584 and description mention 9.18.1+ requirement. Should this be 91810 instead?

Suggested change
if (cudnn_runtime_version >= 91900) {
sdpa_backward_options.set_deterministic_algorithm(deterministic);
}
if (cudnn_runtime_version >= 91810) {
sdpa_backward_options.set_deterministic_algorithm(deterministic);
}

Is there a specific reason FP8 requires cuDNN 9.19.0+ while FP16/BF16 only needs 9.18.1+?

@sudhakarsingh27
Copy link
Collaborator Author

/te-ci pytorch L0

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +1069 to 1078
if (
fused_attention_backend == FusedAttnBackend["FP8"]
and is_training
and device_compute_capability < (10, 0)
):
logger.debug(
"Disabling FusedAttention for determinism reasons with FP8 on arch < sm100"
)
use_fused_attention = False
fused_attention_backend = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing cuDNN version guard for FP8 determinism on SM100

The new check allows FP8 FusedAttention with deterministic=True on SM100 (sm100+) without any cuDNN version requirement. However, in fused_attn_fp8_bwd_impl_v1, the actual set_deterministic_algorithm() call is gated behind cudnn_runtime_version >= 91900 (cuDNN 9.19.0).

For FP8 delayed scaling on SM100 with cuDNN 9.14.0–9.18.x and deterministic=True, this Python filter will allow the backend but the C++ layer will silently skip setting deterministic mode, resulting in non-deterministic execution without any warning.

The analogous filter for FP8 current scaling (line 504) already has an explicit cudnn_version < (9, 18, 0) guard. The same approach should be used here.

Suggested fix:

        if (
            fused_attention_backend == FusedAttnBackend["FP8"]
            and is_training
            and (
                device_compute_capability < (10, 0)
                or cudnn_version < (9, 19, 0)
            )
        ):
            logger.debug(
                "Disabling FusedAttention for determinism reasons with FP8 on arch < sm100"
                " or cuDNN < 9.19.0"
            )
            use_fused_attention = False
            fused_attention_backend = None

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 17, 2026

Additional Comments (1)

tests/pytorch/attention/test_attention.py
deterministic parameter not propagated to actual model runs

The deterministic parameter is used for backend availability checks (via get_available_attention_backends) but is never passed to _run_mha_fp8_vs_f16. That helper function does not accept or apply the deterministic setting when calling mha(...), so both deterministic=True and deterministic=False parametrize cases actually execute identically, providing no additional coverage over a single run.

The same issue exists in test_dpa_fp8_vs_f16 / _run_dpa_fp8_vs_f16.

To actually test deterministic execution, _run_mha_fp8_vs_f16 and _run_dpa_fp8_vs_f16 should set os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] based on the deterministic flag before running the model — which is precisely what the removed lines did unconditionally. For example:

    os.environ["NVTE_ALLOW_NONDETERMINISTIC_ALGO"] = "0" if deterministic else "1"
    flash_attn_fwd_fp8, param_names, flash_attn_bwd_fp8 = _run_mha_fp8_vs_f16(...)

Without this, the test expansion doubles the test matrix with no behavioral difference.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 18, 2026

Additional Comments (1)

tests/pytorch/attention/test_attention.py
deterministic not passed to execution functions

The new deterministic test parameter is used for backend availability filtering (lines 1854-1862, 1868-1873) but is never passed to _run_mha_fp8_vs_f16() (lines 1885-1896). The _run_mha_fp8_vs_f16 function doesn't accept a deterministic parameter, so DotProductAttention / MultiheadAttention modules inside will always use self.deterministic = False (from the default NVTE_ALLOW_NONDETERMINISTIC_ALGO=1).

The same issue applies to _run_dpa_fp8_vs_f16() calls in test_dpa_fp8_vs_f16 (lines 2142-2163).

As a result, when deterministic=True, the test only validates that the backend is correctly selected/rejected — it never actually exercises the deterministic code path through cuDNN. If the intent is to also verify deterministic execution, deterministic needs to be plumbed through to the run functions and set via NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 before module construction.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +1070 to 1079
if (
fused_attention_backend == FusedAttnBackend["FP8"]
and is_training
and device_compute_capability < (10, 0)
):
logger.debug(
"Disabling FusedAttention for determinism reasons with FP8 on arch < sm100"
)
use_fused_attention = False
fused_attention_backend = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing cuDNN version guard - Python allows FP8 deterministic on SM100 without checking cuDNN version, but C++ only sets deterministic mode when cudnn_runtime_version >= 91900 (fused_attn_fp8.cu:2220). With cuDNN 9.14.0-9.18.x, execution will be non-deterministic despite deterministic=True.

Add version check like FP8 current scaling (line 504):

Suggested change
if (
fused_attention_backend == FusedAttnBackend["FP8"]
and is_training
and device_compute_capability < (10, 0)
):
logger.debug(
"Disabling FusedAttention for determinism reasons with FP8 on arch < sm100"
)
use_fused_attention = False
fused_attention_backend = None
if (
fused_attention_backend == FusedAttnBackend["FP8"]
and is_training
and (
device_compute_capability < (10, 0)
or cudnn_version < (9, 19, 0)
)
):
logger.debug(
"Disabling FusedAttention for determinism reasons with FP8 on arch < sm100 "
"or cuDNN < 9.19.0"
)

@sudhakarsingh27
Copy link
Collaborator Author

/te-ci L0

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

…ls.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27
Copy link
Collaborator Author

/te-ci pytorch L0

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@sudhakarsingh27
Copy link
Collaborator Author

/te-ci jax L0

Actually switch off fused-attention backend

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@sudhakarsingh27
Copy link
Collaborator Author

/te-ci L0

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27
Copy link
Collaborator Author

/te-ci L0

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@sudhakarsingh27 sudhakarsingh27 merged commit e8f7c5a into NVIDIA:main Feb 24, 2026
36 of 42 checks passed
KshitijLakhani pushed a commit that referenced this pull request Feb 25, 2026
* add fp8 determinism support

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update cudnn fe to 1.18

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* enable determinism for sm90

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/attention/dot_product_attention/utils.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply suggestion from @greptile-apps[bot]

Actually switch off fused-attention backend

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove extraneous `deterministic` test input arg

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Oleg-Goncharov pushed a commit to Oleg-Goncharov/TransformerEngine that referenced this pull request Feb 27, 2026
…IA#2621)

* add fp8 determinism support

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update cudnn fe to 1.18

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* enable determinism for sm90

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/attention/dot_product_attention/utils.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply suggestion from @greptile-apps[bot]

Actually switch off fused-attention backend

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove extraneous `deterministic` test input arg

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants