Skip to content

fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref#4362

Open
tp5uiuc wants to merge 4 commits into
pytorch:mainfrom
tp5uiuc:fix/runtime-cache-del-shutdown
Open

fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref#4362
tp5uiuc wants to merge 4 commits into
pytorch:mainfrom
tp5uiuc:fix/runtime-cache-del-shutdown

Conversation

@tp5uiuc

@tp5uiuc tp5uiuc commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

Original bug from #4359: engine-implicit RuntimeCache handles relied solely on __del__ to autosave at process exit. When the engine survived to exit, __del__ fired during shutdown and the torchbind / filelock lazy imports raised ImportError: sys.meta_path is None, surfacing as Exception ignored in __del__ once per surviving handle.

Fixes

  • Move the autosave anchor to atexit.register(partial(_autosave_at_exit, weakref.ref(self))). atexit fires before module teardown; __del__
    stays as the mid-program GC path. Both flip autosave_on_del off so
    whichever runs first wins. __del__ unregisters its token to keep the
    atexit registry from accumulating dead entries.
  • Module-level helper closing over a weakref keeps the registration
    non-owning, so handles can still die mid-program normally.
  • RuntimeCache.__getstate__ / __setstate__ strip the unpicklable
    weakref from the pickle stream and re-register a fresh atexit hook
    on unpickle when autosave_on_del was on. __del__ reads the token
    via getattr(..., None) so copy.deepcopy partial-init edge cases
    stay silent.

Dependency on a sibling PR

This PR introduces an _atexit_token slot (partial(..., weakref.ref(self))) on RuntimeCache. CI tests that pickle compiled modules (test_mutable_torchtrt_module::test_save, test_cross_runtime_serde::test_save_*) walk through _runtime_settings.runtime_cache and would hit the weakref. The matching wrapper-side pickle exclusion lives in #4368 and should land first to unblock CI here.

Fixes #4359.

Type of change

  • Bug fix (non-breaking)

Test plan

TestRuntimeCacheAutosave in tests/py/dynamo/runtime/test_000_runtime_cache.py:

  • test_del_swallows_shutdown_import_error_on_pathsys.unraisablehook sees no leak
  • test_atexit_hook_saves_via_weakref — helper saves + flips flag
  • test_atexit_hook_no_op_on_dead_weakref — dead weakref no-ops cleanly
  • test_atexit_token_unregistered_after_delatexit.unregister called with the registered token
  • test_pickle_round_trip_strips_atexit_token — standalone-pickle round-trip rewires a fresh hook (stubbed handle to isolate from the sibling PR's _RuntimeCacheHandle picklability work)
  • pre-commit run clean on touched files (mypy skipped for two pre-existing errors at _TorchTensorRTModule.py:380 and :508, unrelated)

@meta-cla meta-cla Bot added the cla signed label Jun 24, 2026
@github-actions github-actions Bot added component: tests Issues re: Tests component: api [Python] Issues re: Python API labels Jun 24, 2026
@github-actions github-actions Bot requested a review from lanluo-nvidia June 24, 2026 17:21
@tp5uiuc tp5uiuc self-assigned this Jun 24, 2026
@tp5uiuc tp5uiuc marked this pull request as draft June 24, 2026 17:22
@tp5uiuc tp5uiuc force-pushed the fix/runtime-cache-del-shutdown branch from 6d02dad to a0a67d7 Compare June 25, 2026 00:53
@tp5uiuc tp5uiuc changed the title fix(runtime): silence ImportError from RuntimeCache.__del__ fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref Jun 25, 2026
@tp5uiuc tp5uiuc force-pushed the fix/runtime-cache-del-shutdown branch 2 times, most recently from a3770be to eacb80e Compare June 25, 2026 02:10
@tp5uiuc tp5uiuc marked this pull request as ready for review June 25, 2026 02:12
@tp5uiuc tp5uiuc requested a review from cehongwang June 25, 2026 02:12
@github-actions github-actions Bot added component: core Issues re: The core compiler component: runtime component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jun 26, 2026
Comment thread py/torch_tensorrt/runtime/_runtime_cache.py
except Exception:
pass

def __getstate__(self) -> dict[str, Any]:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers : these are added so that users calling pickle.save(RuntimeCache()) works as expected.

tp5uiuc added 3 commits June 29, 2026 12:02
The previous `__del__`-only autosave path silently lost cache updates when
the engine survived until interpreter exit (typical for inference servers).
Python tears down `sys.meta_path` early in shutdown; the torchbind
attribute access in `self._handle.path` and the lazy `filelock` import
inside `save()` then raise `ImportError: sys.meta_path is None`, which
escaped `__del__` and surfaced as a noisy `Exception ignored in __del__`
once per surviving handle.

`atexit` callbacks run *before* module teardown, so the torchbind path
and lazy imports still resolve there. Register an `atexit` hook from
`__init__` whenever `autosave_on_del=True`. The hook closes over a
`weakref.ref(self)` so it doesn't pin the handle alive: a handle that
dies mid-program still goes through `__del__`, and the atexit hook later
sees a dead weakref and no-ops.

Other design points worth calling out:

* `_autosave_at_exit` is a module-level helper, not a bound method. A
  bound method captures `self` via `__self__`, which would defeat the
  weakref. The free function lets the closure carry only the weakref.

* Both `__del__` and the atexit hook flip `autosave_on_del` off before
  saving so whichever path runs first wins and the other no-ops -- no
  double-save, no double-leak risk.

* `__del__` unregisters its atexit token. Without this, a long-running
  process that churns engine-implicit handles (model swaps, A/B
  rollouts) accumulates dead atexit entries -- small per entry but
  unbounded.

* The `try` in `__del__` still wraps the whole body so any residual
  attribute-access failure during late-shutdown corner cases is swallowed
  rather than leaking to `sys.unraisablehook`.

Tests added in `TestRuntimeCacheAutosave`:
- `test_del_swallows_shutdown_import_error_on_path`: monkey-patches
  `_handle.path` to raise the shutdown `ImportError`; asserts via
  `sys.unraisablehook` that nothing leaks.
- `test_atexit_hook_saves_via_weakref`: exercises the helper directly,
  verifies it saves and flips `autosave_on_del`.
- `test_atexit_hook_no_op_on_dead_weakref`: dead weakref => no-op, no
  exception.
- `test_atexit_token_unregistered_after_del`: `atexit.unregister` is
  spied to confirm `__del__` cleaned up.

Refs pytorch#4359
The atexit + weakref design from the previous commit added an
``_atexit_token`` slot (``partial(_autosave_at_exit, weakref.ref(self))``)
to ``RuntimeCache``. ``weakref.ref`` is unpicklable, so any caller that
pickled a ``RuntimeCache`` directly hit
``TypeError: cannot pickle 'weakref.ReferenceType'``.

Two defenses:

* ``__getstate__`` / ``__setstate__`` strip the token from the pickle
  stream and re-register a fresh atexit hook on unpickle when
  ``autosave_on_del`` was on. Atexit registration is a per-process
  artifact; the loading process gets its own hook bound to the loading-
  side instance.

* ``__del__`` reads ``_atexit_token`` via ``getattr(..., None)``.
  ``__init__`` sets the slot first, but ``copy.deepcopy`` can crash
  mid-state-copy on an unrelated field (e.g. the python-runtime
  ``_RuntimeCacheHandle._lock``, which is ``threading.Lock``) and leave
  the new instance with only some attributes set. The defensive read
  keeps ``__del__`` quiet rather than raising ``AttributeError`` to
  ``sys.unraisablehook``.

Test ``test_pickle_round_trip_strips_atexit_token`` exercises the
standalone-pickle path with a ``SimpleNamespace`` stub for ``_handle``
so the test isolates the ``__getstate__``/``__setstate__`` logic from
the orthogonal limitation that ``_RuntimeCacheHandle`` (python rt) is
itself unpicklable today.

The wrapper-side fix that prevents engine-implicit handles from being
pickled via ``torch.save(module)`` at all -- plus the matching
``_RuntimeCacheHandle`` picklability fix on python-rt -- is intentionally
a separate change; this commit only addresses pickle safety of the
``_atexit_token`` slot introduced in the prior commit.

Refs pytorch#4359
…helper

``__init__`` and ``__setstate__`` both registered the same
``atexit.register(partial(_autosave_at_exit, weakref.ref(self)))``
call. Hoist into ``_register_atexit_autosave`` and reuse from both
sites. The idempotency check (``_atexit_token is None``) moves into the
helper, so a second call is a no-op regardless of caller.

Behavior is unchanged. Addresses review comment on
py/torch_tensorrt/runtime/_runtime_cache.py:227.
@tp5uiuc tp5uiuc force-pushed the fix/runtime-cache-del-shutdown branch from da18dc0 to 6372b2f Compare June 29, 2026 19:02
@tp5uiuc

tp5uiuc commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

[by Claude Code] CI status after rebase on #4367:

Remaining red is environmental noise hitting every concurrent PR (and main):

  • L0 dynamo convertertest_attention_aten::test_dynamic_sdpa_fp_mask_3 (torch-nightly SDPA API)
  • L0 torchscripttest_classes::test_get_layer_info + worker crashes (pre-existing Windows torchscript)
  • Python-only / RTX Python-only dynamo runtimetest_004_weight_streaming flakes

None of those touch files this PR modifies.

@cehongwang cehongwang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

Comment thread py/torch_tensorrt/runtime/_runtime_cache.py Outdated
Bullet 1 (Engine-implicit) previously claimed __del__ alone writes the
cache; this PR added a second non-owning hook (atexit + weakref) for
the interpreter-shutdown path where __del__ trips ImportError on lazy
imports like filelock. Document both and the "whichever-fires-first
flips autosave off" handshake.

Compact the _handle paragraph and the identity-based-equality note while
in the area.

Addresses an "update the doc here" review comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: runtime component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 [Bug] Serialization fails in Runtime Cache

2 participants