Benchmark: Model benchmark - deterministic training support by Aishwarya-Tonpe · Pull Request #731 · microsoft/superbenchmark

Aishwarya-Tonpe · 2025-08-28T17:41:54Z

Adds opt-in deterministic training mode to SuperBench's PyTorch model benchmarks. When enabled --enable-determinism. PyTorch deterministic algorithms are enforced, and per-step numerical fingerprints (loss, activation means) are recorded as metrics. These can be compared across runs using the existing sb result diagnosis pipeline to verify bit-exact reproducibility — useful for hardware validation and platform comparison.

Flags added -

--enable-determinism
--check-frequency: Number of steps after which you want the metrics to be recorded
--deterministic-seed

Changes -

Updated pytorch_base.py to handle deterministic settings, logging.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything works as expected.

Usage -

Step 1: Run 1 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file
Step 2: Generate the baseline file from the Run 1 results using - sb result generate-baseline
Step 3: Run 2 - Run with --enable-determinism and the necessary metrics will be recorded in the results-summary.jsonl file on a different machine (or the same machine)
Step 4: Run diagnosis on the results generated from the 2 runs using the - sb result diagnosis command

Note -

Make sure all the parameters are constant between the 2 runs
Running the diagnosis command requires the rules.yaml file

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

Aishwarya-Tonpe · 2025-08-28T19:55:34Z

@Aishwarya-Tonpe please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

codecov · 2025-08-29T17:23:00Z

Codecov Report

❌ Patch coverage is 83.65019% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.68%. Comparing base (575859b) to head (2b52174).

Files with missing lines	Patch %	Lines
...rbench/benchmarks/model_benchmarks/pytorch_base.py	84.54%	17 Missing ⚠️
superbench/common/model_log_utils.py	76.74%	10 Missing ⚠️
superbench/analyzer/baseline_generation.py	52.94%	8 Missing ⚠️
...enchmarks/model_benchmarks/pytorch_mixtral_impl.py	82.85%	6 Missing ⚠️
...perbench/benchmarks/model_benchmarks/model_base.py	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #731      +/-   ##
==========================================
- Coverage   85.70%   85.68%   -0.03%     
==========================================
  Files         102      103       +1     
  Lines        7703     7886     +183     
==========================================
+ Hits         6602     6757     +155     
- Misses       1101     1129      +28

Flag	Coverage Δ
cpu-python3.10-unit-test	`70.40% <41.60%> (-0.56%)`	⬇️
cpu-python3.12-unit-test	`70.40% <41.60%> (-0.56%)`	⬇️
cpu-python3.7-unit-test	`69.83% <39.92%> (-0.61%)`	⬇️
cuda-unit-test	`83.59% <82.44%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

examples/benchmarks/pytorch_deterministic_example.py

superbench/benchmarks/base.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

superbench/benchmarks/model_benchmarks/pytorch_bert.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

superbench/benchmarks/model_benchmarks/pytorch_bert.py

superbench/benchmarks/base.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

guoshzhao · 2025-10-09T17:55:15Z

Thanks for addressing all the comments, since this is a big PR, could we do an apple-2-apple comparision before merging this PR. For example,

Run all e2e model benchmark based on main branch.
Run all e2e model benchmark based on this branch with deterministic training disabled.
Run all e2e model benchmark based on this branch with deterministic training enabled.
And compare if throughput metrics are expected?

Aishwarya-Tonpe · 2025-10-13T19:18:54Z

Thanks for addressing all the comments, since this is a big PR, could we do an apple-2-apple comparision before merging this PR. For example,

Run all e2e model benchmark based on main branch.

Run all e2e model benchmark based on this branch with deterministic training disabled.

Run all e2e model benchmark based on this branch with deterministic training enabled.
And compare if throughput metrics are expected?

Tested and compared all the 3 items listed above. Looks good.
Can share the result files if needed, please lmk. thank you!

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

superbench/benchmarks/model_benchmarks/pytorch_lstm.py

superbench/benchmarks/model_benchmarks/pytorch_bert.py

superbench/benchmarks/model_benchmarks/pytorch_cnn.py

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_base.py

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

examples/benchmarks/pytorch_deterministic_example.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

examples/benchmarks/pytorch_deterministic_example.py

superbench/common/model_log_utils.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

superbench/benchmarks/model_benchmarks/pytorch_base.py

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

superbench/common/model_log_utils.py

examples/benchmarks/pytorch_deterministic_example.py

superbench/analyzer/baseline_generation.py

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/analyzer/data_diagnosis.py

superbench/analyzer/baseline_generation.py

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

examples/benchmarks/pytorch_deterministic_example.py

docs/user-tutorial/benchmarks/model-benchmarks.md

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/model_benchmarks/pytorch_base.py

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

Aishwarya-Tonpe requested a review from a team as a code owner August 28, 2025 17:41

github-advanced-security bot found potential problems Aug 28, 2025

View reviewed changes

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py Fixed Show fixed Hide fixed

guoshzhao assigned polarG Sep 18, 2025

guoshzhao changed the title ~~Aishwaryatonpe/deterministic training~~ Benchmark: Model benchmark - deterministic training support Sep 18, 2025

guoshzhao reviewed Sep 19, 2025

View reviewed changes

examples/benchmarks/pytorch_deterministic_example.py Outdated Show resolved Hide resolved

guoshzhao requested a review from polarG September 24, 2025 23:25