Skip to content

[PyTorch] [CI] Capture subprocess stderr in distributed tests for better CI error re…#2802

Open
sudhakarsingh27 wants to merge 3 commits intoNVIDIA:mainfrom
sudhakarsingh27:sudhakars/improve-distributed-test-errors
Open

[PyTorch] [CI] Capture subprocess stderr in distributed tests for better CI error re…#2802
sudhakarsingh27 wants to merge 3 commits intoNVIDIA:mainfrom
sudhakarsingh27:sudhakars/improve-distributed-test-errors

Conversation

@sudhakarsingh27
Copy link
Copy Markdown
Collaborator

…porting

Distributed tests launch subprocesses via torch.distributed.launch/torchrun. When these fail, pytest only captures the CalledProcessError from the parent process, not the actual worker traceback. This makes CI JUnit XML reports show "exit code 1" with no useful error detail.

Add run_distributed() utility to tests/pytorch/utils.py that captures stderr while letting stdout stream to the terminal. On failure, the worker's stderr (containing the actual Python traceback) is included in the AssertionError, which pytest writes into the JUnit XML report.

Behavior:

  • Interactive use: stdout streams in real time (unchanged), stderr shown on failure
  • CI/JUnit XML: failure reports now include the actual worker traceback

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…porting

Distributed tests launch subprocesses via torch.distributed.launch/torchrun.
When these fail, pytest only captures the CalledProcessError from the parent
process, not the actual worker traceback. This makes CI JUnit XML reports
show "exit code 1" with no useful error detail.

Add run_distributed() utility to tests/pytorch/utils.py that captures stderr
while letting stdout stream to the terminal. On failure, the worker's stderr
(containing the actual Python traceback) is included in the AssertionError,
which pytest writes into the JUnit XML report.

Behavior:
- Interactive use: stdout streams in real time (unchanged), stderr shown on failure
- CI/JUnit XML: failure reports now include the actual worker traceback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Add --output-junit flag so ctest writes JUnit XML to /logs/,
matching the pattern used by pytest tests. The XML is written
before ctest exits, so it's captured even on test failure.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 marked this pull request as ready for review March 27, 2026 07:51
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 27, 2026

Greptile Summary

This PR improves CI error reporting for distributed PyTorch tests by adding a run_distributed() utility in tests/pytorch/utils.py that captures stderr via subprocess.PIPE while letting stdout stream to the terminal in real time. On failure, the captured stderr (containing worker tracebacks) is embedded in an AssertionError, which pytest then writes into JUnit XML reports. Several distributed test files are migrated from bare subprocess.run(check=True) to this utility. The qa/L0_cppunittest/test.sh script also gains --output-junit support for C++ tests.

Key changes:

  • New run_distributed() in tests/pytorch/utils.py with configurable valid_returncodes, a 4 000-char stderr tail on failure, and full **kwargs passthrough for env, timeout, etc.
  • Migration of test_attention_with_cp.py, test_cast_master_weights_to_fp8.py, test_fusible_ops_with_userbuffers.py, and test_torch_fsdp2.py to run_distributed.
  • test_cast_master_weights_to_fp8 (the primary test function in that file) was not migrated, while test_nvfp4_partial_cast_matches_full in the same file was — this appears to be an oversight.
  • Three files retain an unused import subprocess after their migrations.

Confidence Score: 5/5

Safe to merge — all findings are P2 style/cleanup issues with no functional impact.

All remaining issues are non-blocking: one missed migration that still works correctly via check=True, and three dead imports. No correctness, security, or data-integrity concerns.

tests/pytorch/distributed/test_cast_master_weights_to_fp8.py — the test_cast_master_weights_to_fp8 function at line 1008 should be migrated for consistency.

Important Files Changed

Filename Overview
tests/pytorch/utils.py Adds run_distributed() helper that captures stderr into the AssertionError on failure while leaving stdout to stream in real time; clean, well-documented implementation.
tests/pytorch/distributed/test_cast_master_weights_to_fp8.py Partially migrated: test_nvfp4_partial_cast_matches_full converted but test_cast_master_weights_to_fp8 at line 1008 still uses bare subprocess.run(check=True), missing the stderr-capture benefit.
tests/pytorch/attention/test_attention_with_cp.py Both subprocess.run calls correctly migrated to run_distributed; leaves an unused import subprocess at line 6.
tests/pytorch/distributed/test_torch_fsdp2.py Both subprocess.run calls migrated to run_distributed with correct valid_returncodes=(0, 5) to preserve prior semantics; unused import subprocess remains.
tests/pytorch/distributed/test_fusible_ops_with_userbuffers.py Single subprocess.run call migrated correctly; unused import subprocess remains at line 13.
qa/L0_cppunittest/test.sh Adds XML_LOG_DIR variable and passes --output-junit to ctest for JUnit-compatible C++ test reports; straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant PT as pytest (parent process)
    participant RD as run_distributed()
    participant SP as subprocess (torchrun/torch.distributed.run)
    participant W as GPU worker(s)

    PT->>RD: run_distributed(args, valid_returncodes, **kwargs)
    RD->>SP: subprocess.run(args, stderr=PIPE, text=True, **kwargs)
    SP->>W: launch worker processes
    W-->>SP: stdout → terminal (streamed)
    W-->>SP: stderr → PIPE (buffered)
    SP-->>RD: CompletedProcess(returncode, stderr)
    alt returncode in valid_returncodes
        RD-->>PT: return CompletedProcess
    else failure
        RD->>RD: build AssertionError with stderr[-4000:]
        RD-->>PT: raise AssertionError (captured in JUnit XML)
    end
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into sudhakars/impro..." | Re-trigger Greptile

Use (0, 5) for inner pytest runs where 5 means all tests skipped.
**kwargs: Passed through to subprocess.run (e.g. env, timeout).
"""
result = subprocess.run(args, stderr=subprocess.PIPE, text=True, **kwargs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 **kwargs can silently conflict with stderr and text

If a caller ever passes stderr= or text= through **kwargs, Python will raise TypeError: subprocess.run() got multiple values for keyword argument 'stderr'. Consider explicitly popping or blocking those keys, or documenting the restriction:

kwargs.pop("stderr", None)  # always captured internally
kwargs.pop("text", None)    # always text mode internally
result = subprocess.run(args, stderr=subprocess.PIPE, text=True, **kwargs)

None of the current call sites pass these, so this is not an immediate bug — just a fragile API surface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant