feat: Allow for users / kv cache to add aliased I/O for inplace operations by narendasan · Pull Request #4251 · pytorch/TensorRT

narendasan · 2026-05-12T00:19:16Z

Description

Adds support for in-place ATen operators by extending the Torch-TensorRT compile pipeline and C++ runtime with aliased input/output bindings. The motivating case is streaming inference with a key/value cache (e.g. autoregressive decoders, ZoomASR): each step writes a single timestep into the cache, and without aliasing every step pays a full cache-size copy at the engine boundary. With aliased I/O the TensorRT engine writes directly into the user's (or module-held) cache storage; no fresh allocation, no post-engine copy.

Two aliased_io "kinds" are tracked so the runtime can reason about provenance:

kv_cache_update — TensorRT-enforced via IKVCacheUpdateLayer; reported through ICudaEngine::getAliasedInputTensor.
user — Torch-TensorRT-declared; reserved for future expansion if TRT exposes a public non-KV aliasing API.

What this PR does

Pipeline (Python)

New slice_scatter and index_copy converters that recognize KV-cache-update patterns (4-D static cache, dim=2, batch=1) and emit IKVCacheUpdateLayer with the output aliased to the cache input. Non-eligible cases fall back to scatter in TRT — no graph break.
For index_copy, two disjoint converters (validator-gated KV fast path at ConverterPriority.HIGH + scatter fallback at standard priority) cleanly split the cases.
aliased_io plumbed through TRTInterpreter → TRTInterpreterResult → SerializedInterpreterResult → TorchTensorRTModule. The output() step automatically promotes layer outputs that need to be network outputs (KVCacheUpdate requires it) and appends them after user outputs. The user/side-effect boundary is derived at runtime, not stored.

Buffer-style support

lift_mutated_buffers lowering pass detects BUFFER_MUTATION patterns (the trailing aten.copy_(get_attr_buffer, ...) that ep.module() emits) and lifts each mutated buffer from get_attr to placeholder so the engine sees it as an input binding.
inline_lifted_buffers_into_gm post-compile transform registers the buffers as nn.Module state on the compiled GraphModule and rewrites the lifted placeholders to get_attr reads. The result is a plain fx.GraphModule (no custom wrapper class) that serializes cleanly through torch_tensorrt.save / torch.export.
Low-level entry point convert_exported_program_to_serialized_trt_engine gains lift_mutable_buffers: bool = False for power users who want to manage the resulting bindings themselves.

C++ runtime (ABI v9 → v10)

Bumped ABI_VERSION to "10"; added ALIASED_IO_IDX to SerializedInfoIndex.
serialize_aliased_io / deserialize_aliased_io helpers (wire format: output@input@kind records joined by BINDING_DELIM). Helpers live in runtime_utils.cpp alongside serialize_bindings.
TRTEngine constructor reconciles the build-time map against ICudaEngine::getAliasedInputTensor — the engine API is the source of truth for KV-style aliasing.
execute_engine records bound input tensors by binding name; for each output binding in aliased_io, binds the same data_ptr as the source input and skips fresh allocation. Pre-allocated outputs are disabled when aliased I/O is present.
CUDA Graphs integration: aliased input bindings bypass the persistent-clone path so the engine writes through to user storage; aliased outputs are skipped in the post-exec copy-back loop. Capture + replay both correctly mutate the user's tensor.

Docs + examples

docsrc/contributors/inplace_operations.rst — full design doc covering motivation, primitives, pipeline, runtime, serialization format, and known limitations.
Three examples under examples/dynamo/:
- aliased_io_user_inputs.py — caller-owned cache (simplest case)
- aliased_io_buffers.py — module-owned cache via register_buffer
- aliased_io_kv_attention.py — realistic single-layer transformer attention block with static KV cache

Fixes partially #4240 (in-place custom plugins / multiple outputs — addresses the in-place-operator side; plugin-side aliased I/O is explicitly out of scope here).

Type of change

New feature (non-breaking for callers who don't opt in to aliased I/O; ABI-breaking for existing engine binaries — older serialized engines fail the version check and need to be rebuilt, consistent with prior ABI bumps).
This change requires a documentation update (included).

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR

Test summary

38 new tests across 8 files, all passing:

File	Cases	Covers
`tests/py/dynamo/conversion/test_slice_scatter_aten.py`	8	scatter-fallback path (numerical correctness via the standard converter harness)
`tests/py/dynamo/runtime/test_aliased_io.py`	8	end-to-end aliased I/O (user-input single/paired/streaming + buffer-style + regressions)
`tests/py/dynamo/runtime/test_index_copy_kv.py`	4	KV fast path + 3 fallback shapes for `aten.index_copy`
`tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py`	4	low-level `lift_mutable_buffers=True` flag round-trip (introspect engine, construct module, run)
`tests/py/dynamo/runtime/test_aliased_io_serialization.py`	3	`torch_tensorrt.save` / `load` round-trip for user-input + buffer-backed + streaming buffer
`tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py`	3	CUDA Graph capture + replay for both user-input and buffer-backed KV; cudagraphs vs non-cudagraphs parity
`tests/py/dynamo/runtime/test_hf_static_cache_xfail.py`	1	xfail documenting the current HF + StaticCache gap (asserts known failure mode)
`tests/py/dynamo/lowering/test_buffer_lifting.py`	9	`lift_mutated_buffers` + `inline_lifted_buffers_into_gm` unit tests

Known gaps (documented)

Stock HuggingFace decoder LMs with StaticCache don't compile end-to-end yet: torch.export's run_decompositions raises internally on the EP that convert_and_export_with_cache produces. The xfail test asserts the known failure so a future upstream fix surfaces as a test failure. Path forward documented in the design doc.
IKVCacheUpdateLayer requires static s_max. Dynamic-sequence-length cache shapes fall through to the scatter path (still correct, no aliasing).

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-05-12 00:26:56.728308+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-05-12 00:27:18.993037+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests

from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-05-12 00:26:56.731194+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-05-12 00:27:19.634834+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
  placeholders into ``get_attr`` reads and registers the buffers as
  module state. The result is a plain ``fx.GraphModule`` that
  serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect

import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-05-12 00:27:21.127481+00:00
@@ -17,10 +17,11 @@
  is already visible on the user's input).

These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-05-12 00:27:21.138494+00:00
@@ -15,10 +15,11 @@
  ``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
  transform replaces what used to be an external ``BufferThreadingModule``
  wrapper — making the result a plain ``fx.GraphModule`` that exports
  naturally without a custom wrapper class.
"""
+
import tempfile

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-05-12 00:27:21.208126+00:00
@@ -36,10 +36,11 @@
  workaround that skips ``run_decompositions`` for already-decomposed EPs.

When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest

import torch
import torch_tensorrt

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-05-12 00:27:21.275543+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
  return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
  ``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-05-12 00:27:21.304553+00:00
@@ -13,10 +13,11 @@

These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-05-12 00:26:56.733500+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-05-12 00:27:21.380261+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
   aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
   mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_engine

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-05-12 20:26:34.855069+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-05-12 20:26:58.441876+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests

from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-05-12 20:26:34.858373+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-05-12 20:26:59.052592+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
  placeholders into ``get_attr`` reads and registers the buffers as
  module state. The result is a plain ``fx.GraphModule`` that
  serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect

import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-05-12 20:26:34.860665+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-05-12 20:27:00.527955+00:00
@@ -17,10 +17,11 @@
  is already visible on the user's input).

These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-05-12 20:26:34.861069+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-05-12 20:27:00.554620+00:00
@@ -15,10 +15,11 @@
  ``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
  transform replaces what used to be an external ``BufferThreadingModule``
  wrapper — making the result a plain ``fx.GraphModule`` that exports
  naturally without a custom wrapper class.
"""
+
import tempfile

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-05-12 20:26:34.861069+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-05-12 20:27:00.602080+00:00
@@ -36,10 +36,11 @@
  workaround that skips ``run_decompositions`` for already-decomposed EPs.

When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest

import torch
import torch_tensorrt

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-05-12 20:26:34.860665+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-05-12 20:27:00.676156+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
  return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
  ``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-05-12 20:26:34.861069+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-05-12 20:27:00.713992+00:00
@@ -13,10 +13,11 @@

These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-05-12 20:26:34.861069+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-05-12 20:27:00.785133+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
   aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
   mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_engine

github-actions

There are some changes that do not conform to C++ style guidelines:

diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp b/tmp/changes.txt
index a46ad8f..45dbf63 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp
+++ b/tmp/changes.txt
@@ -335,8 +335,7 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
            std::make_unique<torch::autograd::profiler::RecordProfile>(compiled_engine->input_profile_path);
      }

-      setup_input_tensors(
-          inputs, compiled_engine, cudagraphs_enabled, need_cudagraphs_record, bound_inputs_by_name);
+      setup_input_tensors(inputs, compiled_engine, cudagraphs_enabled, need_cudagraphs_record, bound_inputs_by_name);
      // Check if input shapes can be inferred.
      int32_t const io_size{compiled_engine->cuda_engine->getNbIOTensors()};
      std::vector<char const*> names(io_size);
@@ -494,7 +493,6 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
    // (validated at engine construction). The bound-inputs map is unused here.
    std::unordered_map<std::string, at::Tensor> bound_inputs_by_name;

-
    { // Input Setup
      std::unique_ptr<torch::autograd::profiler::RecordProfile> input_profiler_guard;
      if (compiled_engine->profile_execution) {
ERROR: Some files do not conform to style guidelines

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py	2026-06-10 20:10:37.595181+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py	2026-06-10 20:11:00.737050+00:00
@@ -41,10 +41,11 @@
    Optional[SerializedTensorRTEngineFmt],
    List[str],
    List[str],
]

+
def user_output_count(
    output_binding_names: List[str], aliased_io: Dict[str, Tuple[str, str]]
) -> int:
    """Derive the boundary between user-visible outputs and side-effect
    aliased outputs.
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-06-10 20:10:37.618585+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-06-10 20:11:02.717030+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests

from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-06-10 20:10:37.622167+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-06-10 20:11:03.643555+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
  placeholders into ``get_attr`` reads and registers the buffers as
  module state. The result is a plain ``fx.GraphModule`` that
  serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect

import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-06-10 20:11:05.387147+00:00
@@ -15,10 +15,11 @@
  ``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
  transform replaces what used to be an external ``BufferThreadingModule``
  wrapper — making the result a plain ``fx.GraphModule`` that exports
  naturally without a custom wrapper class.
"""
+
import tempfile

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-06-10 20:11:05.513263+00:00
@@ -13,10 +13,11 @@

These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-06-10 20:11:05.532849+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
  return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
  ``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-06-10 20:11:05.670076+00:00
@@ -17,10 +17,11 @@
  is already visible on the user's input).

These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-06-10 20:11:05.709848+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
   aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
   mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_engine
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-06-10 20:10:37.624595+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-06-10 20:11:05.721107+00:00
@@ -36,10 +36,11 @@
  workaround that skips ``run_decompositions`` for already-decomposed EPs.

When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest

import torch
import torch_tensorrt

github-actions

There are some changes that do not conform to C++ style guidelines:

diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp b/tmp/changes.txt
index a46ad8f..45dbf63 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp
+++ b/tmp/changes.txt
@@ -335,8 +335,7 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
            std::make_unique<torch::autograd::profiler::RecordProfile>(compiled_engine->input_profile_path);
      }

-      setup_input_tensors(
-          inputs, compiled_engine, cudagraphs_enabled, need_cudagraphs_record, bound_inputs_by_name);
+      setup_input_tensors(inputs, compiled_engine, cudagraphs_enabled, need_cudagraphs_record, bound_inputs_by_name);
      // Check if input shapes can be inferred.
      int32_t const io_size{compiled_engine->cuda_engine->getNbIOTensors()};
      std::vector<char const*> names(io_size);
@@ -494,7 +493,6 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
    // (validated at engine construction). The bound-inputs map is unused here.
    std::unordered_map<std::string, at::Tensor> bound_inputs_by_name;

-
    { // Input Setup
      std::unique_ptr<torch::autograd::profiler::RecordProfile> input_profiler_guard;
      if (compiled_engine->profile_execution) {
ERROR: Some files do not conform to style guidelines

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py	2026-06-10 20:14:52.123795+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py	2026-06-10 20:15:12.601131+00:00
@@ -41,10 +41,11 @@
    Optional[SerializedTensorRTEngineFmt],
    List[str],
    List[str],
]

+
def user_output_count(
    output_binding_names: List[str], aliased_io: Dict[str, Tuple[str, str]]
) -> int:
    """Derive the boundary between user-visible outputs and side-effect
    aliased outputs.
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-06-10 20:14:52.147711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-06-10 20:15:14.572201+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests

from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-06-10 20:14:52.151132+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-06-10 20:15:15.343322+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
  placeholders into ``get_attr`` reads and registers the buffers as
  module state. The result is a plain ``fx.GraphModule`` that
  serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect

import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-06-10 20:15:16.838431+00:00
@@ -17,10 +17,11 @@
  is already visible on the user's input).

These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-06-10 20:15:16.847891+00:00
@@ -15,10 +15,11 @@
  ``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
  transform replaces what used to be an external ``BufferThreadingModule``
  wrapper — making the result a plain ``fx.GraphModule`` that exports
  naturally without a custom wrapper class.
"""
+
import tempfile

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-06-10 20:15:16.911013+00:00
@@ -36,10 +36,11 @@
  workaround that skips ``run_decompositions`` for already-decomposed EPs.

When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest

import torch
import torch_tensorrt

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-06-10 20:15:16.974958+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
  return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
  ``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-06-10 20:15:17.016423+00:00
@@ -13,10 +13,11 @@

These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-06-10 20:14:52.153711+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-06-10 20:15:17.363571+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
   aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
   mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_engine

cehongwang · 2026-06-16T23:56:38Z

Anything we could do to avoid manually resetting the KV cache before every run?

github-actions

There are some changes that do not conform to C++ style guidelines:

diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp b/tmp/changes.txt
index 18006f2..75a8b6d 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp
+++ b/tmp/changes.txt
@@ -354,8 +354,7 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
            std::make_unique<torch::autograd::profiler::RecordProfile>(compiled_engine->input_profile_path);
      }

-      setup_input_tensors(
-          inputs, compiled_engine, effective_cudagraphs, need_cudagraphs_record, bound_inputs_by_name);
+      setup_input_tensors(inputs, compiled_engine, effective_cudagraphs, need_cudagraphs_record, bound_inputs_by_name);
      // Check if input shapes can be inferred.
      int32_t const io_size{compiled_engine->cuda_engine->getNbIOTensors()};
      std::vector<char const*> names(io_size);
@@ -515,7 +514,6 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
    // (validated at engine construction). The bound-inputs map is unused here.
    std::unordered_map<std::string, at::Tensor> bound_inputs_by_name;

-
    { // Input Setup
      std::unique_ptr<torch::autograd::profiler::RecordProfile> input_profiler_guard;
      if (compiled_engine->profile_execution) {
ERROR: Some files do not conform to style guidelines

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/lowering/passes/decompose_dynamic_slice_scatter.py	2026-06-17 22:23:27.911413+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/lowering/passes/decompose_dynamic_slice_scatter.py	2026-06-17 22:23:53.349374+00:00
@@ -38,13 +38,11 @@
        dim = args[2]
        start: Optional[Any] = args[3] if len(args) > 3 else None
        end: Optional[Any] = args[4] if len(args) > 4 else None
        step: Optional[Any] = args[5] if len(args) > 5 else None

-        is_dynamic = any(
-            isinstance(x, torch.fx.Node) for x in (start, end, step)
-        )
+        is_dynamic = any(isinstance(x, torch.fx.Node) for x in (start, end, step))
        if not is_dynamic:
            continue

        input_val = input_node.meta.get("val")
        if input_val is None:
@@ -82,13 +80,11 @@
                torch.ops.aten.view.default,
                (arange_node, view_shape),
            )

            expand_size = [
-                gm.graph.call_function(
-                    torch.ops.aten.sym_size.int, (src_node, i)
-                )
+                gm.graph.call_function(torch.ops.aten.sym_size.int, (src_node, i))
                for i in range(rank)
            ]
            expand_node = gm.graph.call_function(
                torch.ops.aten.expand.default,
                (view_node, expand_size),
@@ -108,10 +104,8 @@
            dim,
        )

    if changed:
        gm = clean_up_graph_after_modifications(gm)
-        logger.debug(
-            "After decompose_dynamic_slice_scatter:\n%s", gm.graph
-        )
+        logger.debug("After decompose_dynamic_slice_scatter:\n%s", gm.graph)

    return gm
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py	2026-06-17 22:23:27.913545+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py	2026-06-17 22:23:54.506155+00:00
@@ -42,10 +42,11 @@
    Optional[SerializedTensorRTEngineFmt],
    List[str],
    List[str],
]

+
def user_output_count(
    output_binding_names: List[str], aliased_io: Dict[str, Tuple[str, str]]
) -> int:
    """Derive the boundary between user-visible outputs and side-effect
    aliased outputs.
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-06-17 22:23:27.937613+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-06-17 22:23:56.484974+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests

from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-06-17 22:23:27.941414+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-06-17 22:23:57.432798+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
  placeholders into ``get_attr`` reads and registers the buffers as
  module state. The result is a plain ``fx.GraphModule`` that
  serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect

import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_decompose_dynamic_slice_scatter.py	2026-06-17 22:23:27.941414+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_decompose_dynamic_slice_scatter.py	2026-06-17 22:23:57.444168+00:00
@@ -4,10 +4,11 @@
When ``slice_scatter``'s start/end/step is a SymInt (e.g. derived from a
dynamic dim), the static converter path doesn't apply. The lowering pass
rewrites the op into ``arange + view + expand + scatter`` so each piece
hits its existing dynamic-shape converter.
"""
+
import unittest

import torch
import torch_tensorrt
from torch.export import Dim, export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-06-17 22:23:59.314499+00:00
@@ -17,10 +17,11 @@
  is already visible on the user's input).

These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-06-17 22:23:59.423966+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
  return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
  ``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-06-17 22:23:59.464305+00:00
@@ -15,10 +15,11 @@
  ``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
  transform replaces what used to be an external ``BufferThreadingModule``
  wrapper — making the result a plain ``fx.GraphModule`` that exports
  naturally without a custom wrapper class.
"""
+
import tempfile

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-06-17 22:23:59.510863+00:00
@@ -36,10 +36,11 @@
  workaround that skips ``run_decompositions`` for already-decomposed EPs.

When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest

import torch
import torch_tensorrt

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-06-17 22:23:59.592871+00:00
@@ -13,10 +13,11 @@

These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-06-17 22:23:27.943995+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-06-17 22:23:59.971603+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
   aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
   mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_engine

…tions

github-actions

There are some changes that do not conform to C++ style guidelines:

diff --git a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp b/tmp/changes.txt
index 18006f2..75a8b6d 100644
--- a/home/runner/work/TensorRT/TensorRT/core/runtime/execute_engine.cpp
+++ b/tmp/changes.txt
@@ -354,8 +354,7 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
            std::make_unique<torch::autograd::profiler::RecordProfile>(compiled_engine->input_profile_path);
      }

-      setup_input_tensors(
-          inputs, compiled_engine, effective_cudagraphs, need_cudagraphs_record, bound_inputs_by_name);
+      setup_input_tensors(inputs, compiled_engine, effective_cudagraphs, need_cudagraphs_record, bound_inputs_by_name);
      // Check if input shapes can be inferred.
      int32_t const io_size{compiled_engine->cuda_engine->getNbIOTensors()};
      std::vector<char const*> names(io_size);
@@ -515,7 +514,6 @@ std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intr
    // (validated at engine construction). The bound-inputs map is unused here.
    std::unordered_map<std::string, at::Tensor> bound_inputs_by_name;

-
    { // Input Setup
      std::unique_ptr<torch::autograd::profiler::RecordProfile> input_profiler_guard;
      if (compiled_engine->profile_execution) {
ERROR: Some files do not conform to style guidelines

github-actions

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/lowering/passes/decompose_dynamic_slice_scatter.py	2026-06-22 17:24:51.087383+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/lowering/passes/decompose_dynamic_slice_scatter.py	2026-06-22 17:25:13.547284+00:00
@@ -38,13 +38,11 @@
        dim = args[2]
        start: Optional[Any] = args[3] if len(args) > 3 else None
        end: Optional[Any] = args[4] if len(args) > 4 else None
        step: Optional[Any] = args[5] if len(args) > 5 else None

-        is_dynamic = any(
-            isinstance(x, torch.fx.Node) for x in (start, end, step)
-        )
+        is_dynamic = any(isinstance(x, torch.fx.Node) for x in (start, end, step))
        if not is_dynamic:
            continue

        input_val = input_node.meta.get("val")
        if input_val is None:
@@ -82,13 +80,11 @@
                torch.ops.aten.view.default,
                (arange_node, view_shape),
            )

            expand_size = [
-                gm.graph.call_function(
-                    torch.ops.aten.sym_size.int, (src_node, i)
-                )
+                gm.graph.call_function(torch.ops.aten.sym_size.int, (src_node, i))
                for i in range(rank)
            ]
            expand_node = gm.graph.call_function(
                torch.ops.aten.expand.default,
                (view_node, expand_size),
@@ -108,10 +104,8 @@
            dim,
        )

    if changed:
        gm = clean_up_graph_after_modifications(gm)
-        logger.debug(
-            "After decompose_dynamic_slice_scatter:\n%s", gm.graph
-        )
+        logger.debug("After decompose_dynamic_slice_scatter:\n%s", gm.graph)

    return gm
--- /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py	2026-06-22 17:24:51.089544+00:00
+++ /home/runner/work/TensorRT/TensorRT/py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py	2026-06-22 17:25:14.767148+00:00
@@ -42,10 +42,11 @@
    Optional[SerializedTensorRTEngineFmt],
    List[str],
    List[str],
]

+
def user_output_count(
    output_binding_names: List[str], aliased_io: Dict[str, Tuple[str, str]]
) -> int:
    """Derive the boundary between user-visible outputs and side-effect
    aliased outputs.
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-06-22 17:24:51.114004+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/conversion/test_slice_scatter_aten.py	2026-06-22 17:25:16.751307+00:00
@@ -15,10 +15,11 @@
This file covers the fallback path. To force the fallback regardless of
shape we add a small no-op (``+ 0``) to the cache so it isn't a direct
network input — the converter's "input is a placeholder" check fails and
falls through to scatter.
"""
+
import torch
from parameterized import parameterized
from torch.testing._internal.common_utils import run_tests

from .harness import DispatchTestCase
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-06-22 17:24:51.117612+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_buffer_lifting.py	2026-06-22 17:25:17.796066+00:00
@@ -21,10 +21,11 @@
* ``inline_lifted_buffers_into_gm`` rewrites the lifted-buffer
  placeholders into ``get_attr`` reads and registers the buffers as
  module state. The result is a plain ``fx.GraphModule`` that
  serializes via ``torch_tensorrt.save`` without an external wrapper.
"""
+
import inspect

import torch
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_decompose_dynamic_slice_scatter.py	2026-06-22 17:24:51.117612+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/lowering/test_decompose_dynamic_slice_scatter.py	2026-06-22 17:25:17.804039+00:00
@@ -4,10 +4,11 @@
When ``slice_scatter``'s start/end/step is a SymInt (e.g. derived from a
dynamic dim), the static converter path doesn't apply. The lowering pass
rewrites the op into ``arange + view + expand + scatter`` so each piece
hits its existing dynamic-shape converter.
"""
+
import unittest

import torch
import torch_tensorrt
from torch.export import Dim, export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-06-22 17:24:51.120199+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_cudagraphs.py	2026-06-22 17:25:19.653997+00:00
@@ -17,10 +17,11 @@
  is already visible on the user's input).

These tests cover capture + replay correctness for both KV-cache patterns
(user-input and buffer-style).
"""
+
import unittest

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-06-22 17:24:51.120199+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io_serialization.py	2026-06-22 17:25:19.725567+00:00
@@ -15,10 +15,11 @@
  ``torch.export``. The ``inline_lifted_buffers_into_gm`` post-compile
  transform replaces what used to be an external ``BufferThreadingModule``
  wrapper — making the result a plain ``fx.GraphModule`` that exports
  naturally without a custom wrapper class.
"""
+
import tempfile

import torch
import torch_tensorrt
from torch.export import export
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-06-22 17:24:51.120199+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_hf_static_cache_xfail.py	2026-06-22 17:25:19.810856+00:00
@@ -36,10 +36,11 @@
  workaround that skips ``run_decompositions`` for already-decomposed EPs.

When the upstream issues are resolved or those features land, this
xfail test should start passing — flip it to a real test then.
"""
+
import unittest

import torch
import torch_tensorrt

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-06-22 17:24:51.120199+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_aliased_io.py	2026-06-22 17:25:19.813106+00:00
@@ -19,10 +19,11 @@
* ``TorchTensorRTModule.forward`` filters aliased outputs from the user
  return tuple.
* For buffer-style models, ``lift_mutated_buffers`` rewrites the EP and
  ``BufferThreadingModule`` threads buffers through each call.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-06-22 17:24:51.120199+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_index_copy_kv.py	2026-06-22 17:25:19.911947+00:00
@@ -13,10 +13,11 @@

These tests verify both paths end-to-end via the C++ runtime: the
fast path mutates in place, the fallback produces correct numerical
results without aliasing.
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests

--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-06-22 17:24:51.120199+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/runtime/test_lift_mutable_buffers_api.py	2026-06-22 17:25:20.027422+00:00
@@ -14,10 +14,11 @@
3. Construct a ``TorchTensorRTModule`` (C++ runtime — required for
   aliased I/O) with the discovered bindings.
4. Thread the buffer values in on each call and verify in-place
   mutation works (cache state persists across calls).
"""
+
import torch
import torch_tensorrt
from torch.export import export
from torch.testing._internal.common_utils import TestCase, run_tests
from torch_tensorrt.dynamo import convert_exported_program_to_serialized_trt_engine

cehongwang · 2026-06-30T07:21:17Z

+      // the engine writes through to the user's storage. The user is already
+      // required to pass stable input addresses under cudagraphs, so the
+      // aliasing contract is compatible.
+      bool is_aliased_input = false;


I think we should have this info in the compiled time and not recompute this in the execution phase every iteration

Its per input, maybe we can include it in the binding data structure you added

cehongwang · 2026-06-30T07:37:05Z

    }
    // Pre-allocated output can be used when previous and current state are true without shape change
    if (old_pre_allocated_outputs && new_pre_allocated_output && !shape_changed) {
      can_use_pre_allocated_outputs = true;


Should add a condition of no aliased I/O

I think its just that the aliased binding cannot be pre allocated, not that we cant have any pre allocated bindings

cehongwang · 2026-06-30T07:45:38Z

+    // tensor reference would lead to writes against stale storage.
+    if (compiled_engine->use_pre_allocated_outputs && !compiled_engine->aliased_io.empty()) {
+      LOG_DEBUG(
+          "Skipping pre_allocated_outputs cache because engine has aliased I/O; "


Do you think we should just skip aliased I/O and only pre-allocate other I/O?

Looks like it is already implemented, right? We can call create_output tensors here

cehongwang · 2026-06-30T07:52:00Z

+    }
+    auto it = this->aliased_io.find(out_name);
+    if (it == this->aliased_io.end()) {
+      this->aliased_io[out_name] = AliasedIOSpec{std::string(aliased_in), AliasKind::kKVCacheUpdate};


What about kUser?

I think this code is only to detect IKVCacheUpdate bindings basically the internal under the hood loop back case. If its in the map then its a user binding which needs to be returned

cehongwang · 2026-06-30T08:03:20Z

+The ``IKVCacheUpdateLayer`` performs ``output[i, :, writeIndices[i] + s, :] =
+update[i, :, s, :]`` and aliases ``output`` to ``cache``. Inputs:
+
+* ``cache`` — shape ``[b, d, s_max, h]``, network input, static ``s_max``.


It's better to make the symbol clearer: [b, num_heads, s_max, dim]

cehongwang · 2026-06-30T08:07:41Z

+   2. **Source-of-truth reconciliation at deserialize time:** for every output
+      binding, query ``cuda_engine->getAliasedInputTensor(out_name)`` and merge
+      the result into ``aliased_io``. The TRT API is authoritative; the
+      serialized map is a cache that allows the runtime to avoid a per-call


Why per call? We only need to call it once and store it right? Should we consider dropping the metadata for simplicity?

cehongwang · 2026-06-30T21:34:32Z

+path fire, a pre-compile pass ``lift_mutated_buffers`` runs after
+``ep.module()`` and before ``post_lowering``:
+
+1. Scans for ``aten.copy_.default(get_attr, _)`` patterns — the marker for


Does this have a side effect on .copy() operators themselves? i.e. not KV cached ones

cehongwang · 2026-06-30T22:12:47Z

+            original_id = id(output)
+
            if output_dtype is not dtype.unknown:
                output = self._cast_output_dtype(


Should we skip casting if it is aliased?

Should be the input type right?

Is there a case it is not input type if it is aliased?

cehongwang · 2026-06-30T22:15:11Z

+    * Input is an FX placeholder (i.e. a network input — required for
+      aliasing).
+    * Input has rank 4 and fully static shape ``[b, d, s_max, h]``.
+    * ``dim`` argument is exactly ``2``.


why does it have to be 2?

Entries in the cache, that get updated by index_put should be rank 2 right?

narendasan · 2026-06-30T23:52:25Z

+// Encode/decode the aliased_io map. Records are separated by BINDING_DELIM
+// ('%') and each record is "output_name@input_name@kind" (the '@' avoids
+// collision with TRT binding names which are alphanumeric + underscore).
+std::string serialize_aliased_io(const std::unordered_map<std::string, AliasedIOSpec>& aliased_io);


We should fold the binding structure and this together @cehongwang

narendasan requested a review from apbose May 12, 2026 00:19

meta-cla Bot added the cla signed label May 12, 2026

github-actions Bot requested a review from cehongwang May 12, 2026 00:19

This comment was marked as outdated.

Sign in to view

narendasan force-pushed the narendasan/aliased_io branch from 354674d to 813e753 Compare May 12, 2026 00:26

github-actions Bot requested changes May 12, 2026

View reviewed changes

narendasan force-pushed the narendasan/aliased_io branch from 813e753 to bcaf725 Compare May 12, 2026 20:26

github-actions Bot requested changes May 12, 2026

View reviewed changes

narendasan mentioned this pull request May 29, 2026

feat: custom binding names for out of runtime deployment #4309

Open

7 tasks

narendasan force-pushed the narendasan/aliased_io branch from bcaf725 to 3afcfd3 Compare June 10, 2026 20:10

github-actions Bot requested changes Jun 10, 2026

View reviewed changes

narendasan force-pushed the narendasan/aliased_io branch from 3afcfd3 to a8fce13 Compare June 10, 2026 20:14

github-actions Bot requested changes Jun 10, 2026

View reviewed changes

narendasan force-pushed the narendasan/aliased_io branch from a8fce13 to 94d6ac7 Compare June 17, 2026 22:23

github-actions Bot added the component: build system Issues re: Build system label Jun 17, 2026

github-actions Bot requested changes Jun 17, 2026

View reviewed changes

narendasan added 2 commits June 22, 2026 17:24

feat: Allow for users / kv cache to add aliased I/O for inplace opera…

16bb178

…tions

fix: decompose slice scatter to implicitly handle dynamic bounds

e54ed7e

cehongwang force-pushed the narendasan/aliased_io branch from 94d6ac7 to e54ed7e Compare June 22, 2026 17:24

github-actions Bot requested changes Jun 22, 2026

View reviewed changes

cehongwang reviewed Jun 30, 2026

View reviewed changes

narendasan commented Jun 30, 2026

View reviewed changes

Uh oh!

Conversation

narendasan commented May 12, 2026

What this PR does

Type of change

Checklist

Test summary

Known gaps (documented)

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cehongwang commented Jun 16, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

narendasan Jun 30, 2026 •

edited

Loading