[STF] Migrate __stf/allocators/ from cuda_safe_call to cuda_try#9147
Conversation
First in a series migrating production STF headers off the
abort-on-failure ``cuda_safe_call`` onto the throw-on-failure
``cuda_try``, so callers (Python wrappers, exception-aware control flow)
can recover from CUDA errors instead of having the process aborted.
pooled_allocator.cuh:
- ``cudaGetDeviceProperties`` query in ``block_data_pool``'s
constructor: convert to the templated
``cuda_try<cudaGetDeviceProperties>(dev)`` form, which deduces the
first-output substitution and returns the populated struct, so the
variable can be const-initialized. Mark adjacent ``max_mem`` const
as well.
- Leak audit: if the ``cuda_try`` throws, no GPU resources have been
allocated yet (the only ``root_allocator.allocate`` is downstream).
Member subobjects unwound at the throw point have noexcept-clean
destructors (``data_place`` and ``block_allocator_untyped`` are
shared_ptr pimpls). The single call site
(``block_data_pool_set::get_pool``) wraps construction in
``map.emplace``, which is exception-safe.
adapters.cuh:
- ``stream_adapter::clear()`` rewritten to be transactional. The
original for-each over ``to_free`` + lazy sync would, on a thrown
sync, silently abandon the remaining buffers and leave the
``cleared_or_moved`` flag in a contradictory state (either lying
that cleanup succeeded, or firing the destructor's sanity assertion
spuriously). New form pops one buffer at a time, installs a
``SCOPE(exit)`` that frees just that buffer, then syncs lazily. On
throw: the in-flight buffer is freed, ``to_free`` still holds the
remaining pending entries, ``cleared_or_moved`` stays false, and
the caller can catch + retry (or let the destructor's assertion
fire with accurate state). Inter-buffer order is irrelevant (each
``raw_buffer`` is independent), so popping from the back is the
O(1) choice.
- Move-from-back + ``pop_back`` skips one shared_ptr refcount bump
per iteration on ``data_place``. The move is noexcept, so no
half-moved-not-popped risk.
- Marked ``stream`` and the per-iteration ``b`` const.
- Pulled ``scope_guard.cuh`` in explicitly rather than relying on
transitive inclusion.
Pilot PR for the broader migration -- intentionally small (2 sites) so
the conversion patterns (transactional SCOPE-based cleanup, templated
``cuda_try<F>`` form, const-correctness sweep) can be reviewed before
scaling up.
|
/ok to test b81c5df |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
important: ## Walkthrough Stream adapter and pooled allocator update error-handling: stream_adapter::clear() drains pending buffers with lazy stream synchronization and deferred error propagation; pooled allocator constructor queries device properties via cuda_try initialization. ChangesSTF allocator error handling updates
Possibly related PRs
Suggested labels
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/allocators/pooled_allocator.cuh (1)
76-76: ⚡ Quick winsuggestion: qualify the new
cuda_try<cudaGetDeviceProperties>call from the global namespace instead of relying on unqualified lookup in this header. As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace, e.g., ::cuda::ceil_div, including calls to functions in the same namespace."
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d250ca80-365d-4dfb-bde7-d7c11380d77c
📒 Files selected for processing (2)
cudax/include/cuda/experimental/__stf/allocators/adapters.cuhcudax/include/cuda/experimental/__stf/allocators/pooled_allocator.cuh
Review bot flagged that the SCOPE(exit) cleanup in the previous version
of clear() calls b.memory_node.deallocate(...), which can throw --
data_place_*::deallocate all use cuda_try internally for cudaFreeHost /
cudaFree / cudaFreeAsync, and the invalid / affine / device_auto
variants throw std::logic_error unconditionally. SCOPE bodies are
noexcept, so a deallocate-throw during unwinding from a sync failure
would call std::terminate.
Restructured to avoid putting deallocate() in a noexcept context.
Capture the sync status (do not throw on the spot), do the
deallocation normally, then surface the captured error via
cuda_try(sync_err) afterwards.
Failure modes:
- Sync ok, deallocate ok: loop continues, cleared_or_moved = true.
- Sync fails, deallocate ok: deallocate runs, cuda_try(sync_err)
throws cuda_exception, to_free holds the rest, cleared_or_moved
stays false. Caller can retry.
- Sync ok, deallocate throws: deallocate's exception propagates,
to_free holds the rest, cleared_or_moved stays false.
- Both fail (correlated -- likely the same sticky CUDA error):
deallocate's exception wins, sync_err is lost. User-visible
diagnostic is equivalent because both reflect the same root cause.
Also dropped the now-unused scope_guard.cuh include.
|
Thanks for the review -- you were right on all counts. Fixed in 65d7e9a. Restructured to avoid putting cudaError_t sync_err = cudaSuccess;
if (!stream_synchronized && !b.memory_node.allocation_is_stream_ordered())
{
sync_err = cudaStreamSynchronize(stream);
if (sync_err == cudaSuccess) { stream_synchronized = true; }
}
b.memory_node.deallocate(b.ptr, b.sz, stream);
cuda_try(sync_err);No SCOPE guards, no noexcept body containing throwing code. The transactional invariant ("on throw, Also dropped the now-unused Side note for a follow-up PR: |
|
/ok to test 65d7e9a |
Rename ``stream_synchronized`` -> ``cudaStreamSynchronize_was_called``
and ``sync_err`` -> ``cudaStreamSynchronize_result``. The new flag name
matches the new semantics: it tracks whether the sync was attempted at
all, not whether it succeeded. Drop the now-dead
``if (sync_err == cudaSuccess) { ... = true; }`` -- on failure the
subsequent ``cuda_try`` throws and exits the loop, so the "retry on
next iteration" branch is unreachable.
Add a short inline comment ahead of the two throwing statements noting
that on throw the loop is left in steady state (``to_free`` accurate,
``cleared_or_moved`` false), and refresh the upstream comment block to
reference the renamed local.
No behavioral change.
|
/ok to test a9d287a |
This comment has been minimized.
This comment has been minimized.
Test ``cudax.test.stf.threads.axpy-threads-graph`` aborts with ``malloc(): smallbin double linked list corrupted`` on H100 with this PR's changes. The CI investigator's leading hypothesis is the ``raw_buffer`` ctor's ``memory_node(mv(memory_node_))`` (slipped in during the move-semantics conversation) interacting badly with the ``clear()`` loop's ``mv(...back()) + pop_back()`` pattern. Code review of both changes does not reveal an obvious double-free, so revert the ctor change and keep the ``clear()`` restructure. If CI then goes green, the ctor mv was the culprit and can be reintroduced more carefully (or left out -- the saving is one shared_ptr refcount bump per emplace, negligible in this code path). If CI still fails, the bug is in the ``clear()`` loop and needs a deeper look. No behavioral change relative to ``main``'s original raw_buffer ctor.
|
Bisect: reverted the 5200237 |
|
/ok to test 5200237 |
|
Actionable comments posted: 0 |
This comment has been minimized.
This comment has been minimized.
|
#9186 hopefully deal with these concurrency issues |
|
/ok to test b772be7 |
|
/ok to test 7273e6a |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 8a3e21b |
This comment has been minimized.
This comment has been minimized.
|
/ok to test a5af2e6 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test |
@andralex, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test a13bdb9 |
🥳 CI Workflow Results🟩 Finished in 1h 45m: Pass: 100%/55 | Total: 1d 09h | Max: 1h 03m | Hits: 13%/191739See results here. |
Summary
First PR in a series migrating production STF headers off the abort-on-failure
cuda_safe_calland onto the throw-on-failurecuda_try, so callers (Python wrappers, any exception-aware control flow) can recover from CUDA errors instead of having the process aborted underneath them.Intentionally small -- two call sites in
cudax/include/cuda/experimental/__stf/allocators/-- so the conversion patterns can be reviewed before scaling up.Changes
pooled_allocator.cuhblock_data_pool's constructor usedcudaGetDevicePropertiesviacuda_safe_call. Now uses the templatedcuda_try<F>(args...)form -- which deduces the first-output substitution and returns the populated struct -- so the variable can be const-initialized:Leak audit on throw:
root_allocator.allocate(...)).data_placeandblock_allocator_untypedare shared_ptr pimpls).map.emplace(...), which is exception-safe.adapters.cuhstream_adapter::clear()was a for-each over theto_freevector with a lazy sync before the first blocking deallocation. Undercuda_safe_calla sync failure just aborted; converting tocuda_tryexposed the lack of exception safety -- on a thrown sync, the unprocessed buffers would have been silently abandoned, and the destructor's_CCCL_ASSERT(cleared_or_moved, ...)would either lie or fire spuriously.Rewritten to be transactional:
On a throw from
cudaStreamSynchronize:SCOPE(exit).to_freeaccurately holds the remaining un-deallocated entries.cleared_or_movedstaysfalse.clear(), or let the destructor's assertion fire with truthful state.Inter-buffer deallocation order doesn't matter (each
raw_bufferis independent), so popping from the back is the natural O(1) choice.mv(...back())+pop_back()skips one shared_ptr refcount bump ondata_placeper iteration; the move is noexcept, so no half-moved-not-popped risk.Explicit
#include <cuda/experimental/__stf/utility/scope_guard.cuh>added rather than relying on transitive inclusion.Migration pattern notes (for follow-ups)
cuda_trysubstitution. Prefer the templatedcuda_try<F>(args...)form when the function has an out-parameter, so the result can be const-initialized.cuda_try+SCOPE(fail)for the rollback. Inside theSCOPE(fail)body, usecuda_safe_call, notcuda_try-- guard destructors arenoexcept, so a thrown exception during unwinding wouldstd::terminate.clear()pattern): when looping over state to release, pop incrementally and use per-iterationSCOPE(exit)so the in-flight item is always freed and the remaining queue stays accurate on throw.cuda_safe_call. Same rationale -- those contexts arenoexcept.SCOPE(fail)rollback in the constructor body.Test plan
/ok to testtriggered in comment below)