Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions .claude/skills/analyze-test-report/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
name: analyze-test-report
description: "Analyze torch-tensorrt local test results and drive failures to a fix. Use when the user pastes a test report / summary, asks why tests failed, asks to triage or fix failing tests, or mentions the JUnit/test-summary output from `just tests-report` / `just test-summary`. Covers where the JUnit XMLs live, how to read the consolidated report, how to reproduce a single failure, and how to categorize (real bug vs torch-API change vs OOM/skip vs flake)."
---

# Analyzing the torch-tensorrt test report

The local test tiers write **one JUnit XML per pytest suite**, and
`tests/py/utils/junit_summary.py` aggregates them into one report. The JUnit XMLs are
the source of truth — pytest exit codes can be masked when suites run in
sequence, so always reason from the XMLs / the report, not from "the run exited
non-zero".

## Where the output lives

JUnit XMLs are written to (first that is set):
- `$RUNNER_TEST_RESULTS_DIR` — set by CI.
- `$TMPDIR/trt_test_results` — locally. `$TMPDIR` defaults to
`/tmp/torch_tensorrt_$USER`, so the usual local path is:

```
/tmp/torch_tensorrt_<user>/trt_test_results/*.xml
```

Each file is named after its suite, e.g. `l1_dynamo_compile_tests_results.xml`,
`l0_dynamo_core_runtime_tests_results.xml`.

## Getting a report

- **Run a tier and get the agent report in one step** (best for an agent —
runs every suite past failures, then prints the paste-ready Markdown with node
ids, file, junit path, repro, message, traceback):
```sh
just tests-report l1 --agent # l0 | l1 | l2, optionally -ext
just tests-report l2-ext --agent # -ext also installs the model-test deps
```
Throttle the GPU with `just jobs=2 tests-report l2 --agent` if it OOMs.
- **Just re-render the last run's report** (no re-run):
```sh
just test-summary --agent # agent Markdown
just test-summary # color-coded terminal report
```
- Or run the script directly on any results dir:
```sh
python3 tests/py/utils/junit_summary.py /tmp/torch_tensorrt_<user>/trt_test_results --agent
```

If the user pasted a report, work from it directly. If you need more than it
shows (full traceback), open the `junit:` path it lists.

## Reading the agent report

Each failure block gives you everything to act:
- **`### N. [FAIL|ERROR] classname::name`** — exact pytest node identity.
- **`file:`** — the test source file.
- **`junit:`** — the JUnit XML; read its `<failure>` / `<error>` element for the
complete traceback (the report caps detail at 40 lines).
- **`repro:`** — a copy-paste command that re-runs the test.
- **`message:` / `detail:`** — the headline and (capped) traceback.

To pull the full traceback for one failure straight from the XML:
```sh
python3 - <<'PY'
import xml.etree.ElementTree as ET
r = ET.parse("<junit-path>").getroot()
for tc in r.iter("testcase"):
for tag in ("failure", "error"):
e = tc.find(tag)
if e is not None:
print(f"== {tc.get('classname')}::{tc.get('name')} ==")
print(e.get("message"), "\n", e.text)
PY
```

## Reproducing a failure

Use the `repro` line. Notes that matter on this repo:
- Run via `uv run --no-sync` — uses the already-built `.venv`, does **not**
rebuild torch-tensorrt. (Plain `uv run` would try to rebuild and fail.)
- `-n0` forces serial (one process). The default pytest config is `-n auto`,
which spawns a worker per core; on a single GPU that **OOMs** (CUDA out of
memory + segfaulting workers). For broader local runs use `just jobs=2 ...`.
- Set `TMPDIR=/tmp/torch_tensorrt_<user>` (or just use the `just` recipes, which
set it) so the TRT engine/timing cache is writable.

Re-run a single test, then the whole suite once it passes:
```sh
TMPDIR=/tmp/torch_tensorrt_$USER uv run --no-sync pytest <file> -k '<name>' -n0
just jobs=2 tests-l1-dynamo-compile # the suite the failure came from
```

## Categorizing failures (triage before fixing)

- **Real converter/lowering bug** — wrong output, cosine-sim below threshold,
shape/dtype error in `py/torch_tensorrt/...`. Fix the converter/lowering pass.
- **torch-API change** — `RuntimeError`/`AttributeError` from a torch op whose
signature/behavior changed in the nightly (the repo tracks torch nightlies).
Update the call site or the test to the new API; confirm the rule against the
installed torch before editing (`uv run --no-sync python -c "..."`).
- **OOM / segfault cascade** — `CUDA error: out of memory`, crashed workers.
Not a code bug: too many xdist workers for the GPU, or the GPU is occupied.
Re-run with `-n0` / `just jobs=2`; check `nvidia-smi`.
- **Skipped, not failed** — model tests skip without the `test-ext` deps
(`just install-test-ext`), and RTX/platform-gated tests skip by design.
Skips are healthy; don't "fix" them.
- **Flake** — passes on re-run with `-n0`. Only the narrow cudagraph stream-
capture transient is retried in CI (see `tests/py/utils/ci_helpers.sh`).

## Fix loop

1. Get/read the agent report; list the distinct failures and categorize each.
2. For each real failure: read the `junit` traceback, open the `file`, fix.
3. Re-run just that test with its `repro` (serial). Iterate.
4. Re-run the originating suite (`just jobs=2 tests-<tier>`), then
`just test-summary` to confirm the consolidated report is green.

## Related

- Tier definitions (what each suite runs): `tests/py/utils/ci_helpers.sh`
(`trt_tier_*`), shared with CI (`.github/workflows/_linux-x86_64-core.yml`).
- Local recipes: `justfile` (`tests-l0/l1/l2[...]`, `tests-report`,
`test-summary`, `install-test-ext`).
- Building / torch-nightly upgrades: the `build` skill.
23 changes: 23 additions & 0 deletions .github/scripts/filter-matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,30 @@
rtx_cuda_versions: List[str] = ["cu130", "cu132"]
trt_cuda_versions: List[str] = ["cu130", "cu132"]

# For PRs we build/test a single representative config to keep cycle time short.
# Full matrix runs on main / nightly / release branches.
PR_PYTHON_VERSION: str = "3.12"
PR_CUDA_VERSION: str = "cu130"

jetpack_container_image: str = "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
sbsa_container_image: str = "quay.io/pypa/manylinux_2_39_aarch64"


def pick_pr_representative(includes: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Return a single-item list for PR builds to keep cycle time short.

Prefers PR_PYTHON_VERSION + PR_CUDA_VERSION; falls back to the first
available item so the matrix is never empty.
"""
preferred = [
item
for item in includes
if item.get("python_version") == PR_PYTHON_VERSION
and item.get("desired_cuda") == PR_CUDA_VERSION
]
return preferred[:1] or includes[:1]


def validate_matrix(matrix_dict: Dict[str, Any]) -> None:
"""Validate the structure of the input matrix."""
if not isinstance(matrix_dict, dict):
Expand Down Expand Up @@ -126,6 +146,9 @@ def main(args: list[str]) -> None:
):
filtered_includes.append(item)

if options.limit_pr_builds == "true" and options.jetpack != "true":
filtered_includes = pick_pr_representative(filtered_includes)

filtered_matrix_dict = {"include": filtered_includes}
print(json.dumps(filtered_matrix_dict))

Expand Down
Loading
Loading